Multi-Scale Spatio-Temporal Feature-Driven Fault Diagnosis for Quadrotor Formations of China UAV Drones

In recent years, the rapid advancement of micro-electromechanical systems, high-energy-density batteries, and navigation control algorithms has catalyzed significant progress in UAV technology. Among these, quadrotor China UAV drones, with their vertical take-off and landing capabilities, high maneuverability, and ease of deployment, have found widespread applications in fields such as agricultural plant protection, emergency rescue, military reconnaissance, and cultural entertainment performances. The formation of China UAV drones, through multi-agent coordination, can accomplish tasks that a single UAV cannot, with higher efficiency, broader coverage, and stronger robustness.

However, in complex mission scenarios, the reliability of quadrotor formation systems faces severe challenges, with actuator faults being one of the most critical threats. Since actuators directly determine the thrust output and attitude control accuracy of a UAV, their failure not only compromises the flight stability of a single UAV but also propagates disturbances through communication and coordination links, ultimately degrading the collaborative performance of the entire formation. The presence of wind disturbances further exacerbates this challenge: wind loads generate random, time-varying forces that distort the flight dynamics of the UAV while simultaneously complicating sensor signal characteristics and masking fault-related patterns. Therefore, actuator fault diagnosis under wind disturbance requires the model to possess the ability to accurately distinguish intrinsic fault features from external disturbances. Neglecting these factors can easily lead to delayed fault detection, biased localization, or underestimation of fault severity, posing significant risks to the stable operation and successful mission completion of the formation.

Existing UAV fault diagnosis methods can be broadly categorized into three types: model-based methods, signal-based methods, and data-driven methods. Model-based methods rely on precise mathematical models of UAV flight dynamics, designing observers or residual generators to detect deviations between predicted and measured states. While these methods offer strong interpretability, they are highly sensitive to modeling uncertainties and environmental disturbances. Signal-based methods analyze time-domain, frequency-domain, or time-frequency-domain features of sensor data for fault diagnosis, but they struggle with compound faults and non-stationary flight conditions. In the data-driven fault diagnosis paradigm, machine learning methods learn implicit patterns from data to classify fault states. However, these methods require effective feature extraction before classification, demanding substantial domain expertise for appropriate preprocessing techniques. In contrast, deep learning integrates feature extraction and fault classification into an end-to-end unified framework, achieving more efficient and automated fault diagnosis. Saied et al. proposed a deep reinforcement learning-based method for UAV actuator fault diagnosis, combining the automatic feature extraction advantages of deep learning with the interactive learning characteristics of reinforcement learning, effectively improving diagnostic accuracy and robustness. Liu et al. proposed an audio-signal-based quadrotor fault diagnosis scheme, using CNN and transfer learning to achieve blade damage detection by analyzing time-frequency spectra of flight noise.

Despite these advancements, existing fault diagnosis methods still exhibit three significant limitations: First, most research focuses on endogenous fault diagnosis of single UAVs, where each UAV independently diagnoses its own faults. This is inadequate for quadrotor formation systems, where the operational states of individual UAVs are tightly coupled through communication links. Therefore, exogenous fault diagnosis utilizing neighbor state information is crucial. Second, existing studies rarely consider the impact of wind disturbances, which significantly increase the difficulty of fault signal extraction and feature representation. Third, current research often simplifies fault diagnosis into a single-task classification problem. However, in practical engineering scenarios, fault diagnosis requires the simultaneous achievement of three major objectives: fault localization, fault type identification, and fault severity quantification. Research on multi-task fault diagnosis for China UAV drone formations is still in its infancy, with relevant outcomes being scarce. These shortcomings lead to limited generalization performance of existing models in actual formation scenarios with environmental uncertainties.

In recent years, attention mechanisms and multi-scale feature extraction techniques have become effective means to enhance the robustness of fault diagnosis. Attention mechanisms enable models to selectively focus on task-relevant features, thereby strengthening the capture of global information. Multi-scale convolutional modules can extract feature information at different resolutions, offering significant advantages for capturing complex spatio-temporal patterns in non-stationary flight environments. However, the integration of these technologies into a distributed fault diagnosis framework for quadrotor formations under wind disturbance remains an area with considerable room for exploration.

To address the limitations of existing research, which overly relies on single-UAV endogenous diagnosis, inadequately considers wind disturbance effects, and fails to meet multi-dimensional diagnostic needs with single-task approaches, this paper proposes a cascaded multi-task fault diagnosis model integrating a variable-level attention mechanism and a multi-scale spatio-temporal feature extractor. The main contributions are summarized as follows:

1. A novel multi-scale spatio-temporal feature extractor based on an attention mechanism is proposed. This extractor is embedded within a cascaded multi-task fault diagnosis network, specifically designed for actuator fault diagnosis scenarios of quadrotor China UAV drone formations under wind disturbance. The core objective is to achieve the collaborative completion of fault localization, fault type identification, and fault severity quantification, effectively addressing the issue of reduced diagnostic accuracy caused by the mixing of wind disturbance noise and fault features.

2. A variable-level attention mechanism integrated with the multi-scale spatio-temporal feature extractor is designed. This mechanism adaptively prioritizes flight parameters relevant to the fault, suppressing redundant information dominated by wind disturbances. Simultaneously, two-dimensional multi-scale convolutions are used to capture spatial features from the flight data of the quadrotor formation, while a Gated Recurrent Unit (GRU) models the temporal evolution of the fault. A cascaded multi-task architecture is developed on this basis, enabling joint diagnosis of fault localization, fault type classification, and fault severity quantification, breaking through the limitations of the single-task diagnostic paradigm.

3. A quadrotor formation simulation platform is constructed to simulate various actuator fault types and wind disturbances of different intensities, generating a fault dataset as a validation benchmark. Experiments demonstrate that the proposed method exhibits good fault diagnosis accuracy and robustness.

Problem Formulation

This section introduces the quadrotor UAV model, formation control method, actuator faults, and wind disturbances, describing the fault problem for quadrotor formations and defining the research objectives of this paper.

2.1 Quadrotor Formation Modeling

The kinematic and dynamic models of each UAV are built upon classical quadrotor modeling theory. The motion state of a UAV is characterized by its position, velocity, and acceleration in the inertial frame, as well as its attitude angles and angular velocities in the body frame. These parameters fully describe the spatial motion behavior of the aircraft. The dynamic equations are expressed as follows:

$$
\begin{cases}
\ddot{x} = (\cos\phi \sin\theta \cos\psi + \sin\phi \sin\psi) \frac{T}{M}, \\
\ddot{y} = (\cos\phi \sin\theta \sin\psi – \sin\phi \cos\psi) \frac{T}{M}, \\
\ddot{z} = (\cos\phi \cos\theta) \frac{T}{M} – g, \\
\ddot{\phi} = \dot{\theta}\dot{\psi} \frac{J_{yy} – J_{zz}}{J_{xx}} + \frac{l}{J_{xx}} + \frac{d_\phi}{J_{xx}}, \\
\ddot{\theta} = \dot{\phi}\dot{\psi} \frac{J_{zz} – J_{xx}}{J_{yy}} + \frac{m}{J_{yy}} + \frac{d_\theta}{J_{yy}}, \\
\ddot{\psi} = \dot{\phi}\dot{\theta} \frac{J_{xx} – J_{yy}}{J_{zz}} + \frac{n}{J_{zz}} + \frac{d_\psi}{J_{zz}}.
\end{cases}
$$

where $M$ is the mass of the quadrotor China UAV drone, $g$ is the gravitational acceleration, $(x, y, z)$ is the position in the inertial frame, $(\phi, \theta, \psi)$ are the attitude angles, $T$ is the total thrust, $l, m, n$ are the roll, pitch, and yaw moments, and $(J_{xx}, J_{yy}, J_{zz})$ are the moments of inertia.

The communication topology of the quadrotor formation is described by an undirected connected graph:

$$
G = (V, E, A).
$$

where $V = \{0, 1, …, N\}$ is the set of nodes, $E \subseteq N \times N$ represents the set of communication links, and $A$ is the adjacency matrix describing the dynamic communication topology. This graph-based representation can simulate uncertain communication conditions and information interaction processes in actual formation flight.

Each China UAV drone achieves six-degree-of-freedom motion control through a cascaded PID control algorithm. To realize cooperative behavior among the UAVs, a consensus-based formation control strategy is adopted: the leader UAV provides reference signals, while follower UAVs adjust their states based on the leader’s and neighbors’ states. By designing local interaction rules, this algorithm ensures that the formation maintains a coordinated structure during tasks such as vertical takeoff and landing, straight-line flight, and turning, achieving robust cooperative control while maintaining communication topology flexibility.

2.2 Actuator Faults and Wind Disturbances

During the actual flight of quadrotor China UAV drones, external disturbances, internal aging, or signal faults can lead to various actuator faults, threatening flight stability. Wind disturbances alter the relative airspeed of the UAV, generating additional aerodynamic forces and moments that cause deviations in position and attitude. Therefore, it is necessary to model these factors and analyze their potential impacts.

2.2.1 Actuator Fault Modeling

The actuator system of a quadrotor consists of motors, electronic speed controllers, and propellers, jointly determining the thrust and maneuverability of the UAV. Three typical fault models are established:

(1) Efficiency Loss Fault due to long-term motor wear:

$$
T_i = (1 – \eta) T_{i, \text{normal}}.
$$

where $\eta \in [0, 1)$ is the loss coefficient, and $T_{i,\text{normal}}$ is the rated thrust.

(2) Random Fluctuation Fault due to unstable power supply voltage, control signal interference, or abnormal electronic speed controller:

$$
T_i = T_{i, \text{normal}} + \sigma \cdot \mathcal{N}(0, 1).
$$

where $\sigma$ is the intensity of the fluctuation, and $\mathcal{N}(0, 1)$ is the standard normal distribution.

(3) Thrust Lock Fault due to loss of control signal or electronic speed controller failure:

$$
T_i = T_{i, \text{lock}}.
$$

where $T_{i,\text{lock}}$ is the output thrust when the actuator is locked.

2.2.2 Wind Disturbance Modeling

Wind disturbances in the flight environment include steady wind, turbulence, and gust. Steady wind represents an airflow whose magnitude and direction remain essentially constant over the simulation time scale, representing the average wind field of the atmospheric environment and serving as the fundamental component of wind disturbances. In the North-East-Down (NED) coordinate system, the velocity vector of steady wind is defined as:

$$
\mathbf{v}_w^i = [w_n, w_e, w_d]^T.
$$

where $w_n, w_e, w_d$ are the wind speeds (m/s) in the north, east, and down directions, respectively.

Gust refers to a sudden increase in wind speed over a short period, typically lasting from seconds to minutes, representing an instantaneous pulsation of wind speed. A sinusoidal function is used to simulate the gust wind speed amplitude:

$$
v_g^i = v_{g_{\max}} \sin\left(\frac{\pi t}{2T_g}\right).
$$

where $v_{g_{\max}}$ is the peak gust wind speed, $T_g$ is the duration parameter, and the gust direction is given by the unit vector $\mathbf{u}_g$ in the NED coordinate system, i.e., $\mathbf{v}_g^i = v_g^i \mathbf{u}_g$.

Turbulence is a random component of the wind that varies in time and space, being a core factor causing random fluctuations in UAV flight parameters. It can be simulated using the Dryden model. The Dryden turbulence model is a recognized engineering-level turbulence simulation model in the aerospace field. Its core idea is to treat turbulence as Gaussian white noise passed through a spatial filter, which can accurately reproduce the statistical properties of low-altitude atmospheric turbulence. In the MATLAB simulation environment, the DrydenWindTurbulence function can be directly called to generate turbulence wind vectors conforming to the Dryden model. The function input parameters include reference wind speed, turbulence intensity, turbulence scale length, and simulation sampling rate. The output is the real-time turbulence wind vector in the NED coordinate system:

$$
\mathbf{v}_t^i = [u_t, v_t, w_t]^T.
$$

By summing the steady wind, gust, and turbulence components, the total wind field velocity vector in the NED coordinate system is obtained. It is then transformed to the body frame using a coordinate transformation matrix to obtain the total relative wind speed in the body frame:

$$
\mathbf{v}_{\text{wind}}^i = \mathbf{v}_w^i + \mathbf{v}_g^i + \mathbf{v}_t^i,
$$

$$
\mathbf{v}_{\text{wind}}^b = \mathbf{R}_b^i \mathbf{v}_{\text{wind}}^i.
$$

Through the unified modeling and coupled simulation of UAV dynamics evolution, formation cooperative control, environmental disturbances, and fault injection, it is possible to realistically reproduce the formation flight under the coupled conditions of wind disturbance and actuator faults.

2.2.3 Impact Analysis

The observables for the fault diagnosis method proposed in this paper are the time-series flight state parameter sets of the quadrotor China UAV drone formation, covering multi-dimensional flight parameters of multiple UAVs within the formation. These can be categorized into three major types based on physical meaning: kinematic parameters, dynamic parameters, and actuator control parameters. Kinematic parameters include position, velocity, attitude angles, and angular velocities, reflecting the spatial pose, motion trend, and formation coordination state, representing the kinematic response of the UAV to external forces. Dynamic parameters are accelerations, reflecting the instantaneous dynamic response of the UAV to external forces such as thrust and wind drag, and are related to the combined effect of actuator thrust output and environmental disturbances. The throttle command is the control parameter for the actuator, directly determining the target thrust output of the actuator.

The impact of wind disturbances on the observables is a globally coherent coupling, the extent of which is positively correlated with the type and intensity of the wind disturbance. This manifests as global consistent fluctuations in the observables. Steady wind couples slowly and linearly with kinematic parameters, causing a holistic bias in the formation’s pose. The amplitude of this bias increases linearly with the wind disturbance intensity. The flight control system will counteract the effect of steady wind by slightly adjusting the throttle commands. Gust and turbulence cause non-linear fast coupling with angular velocities and accelerations. The instantaneous impact of gust and the random perturbations of turbulence induce high-frequency, irregular fluctuations in accelerations and angular velocities. The flight control system will rapidly adjust throttle commands to compensate for attitude and position deviations.

The coupling characteristics of actuator faults with observables are related to the fault type, manifesting as local specific distortions in the observables. For example, an efficiency loss fault exhibits a strong mapping coupling with throttle commands; the flight control system gradually increases the throttle command for the faulty actuator to compensate for the actual thrust loss. This fault also shows a slow, gradual coupling with kinematic parameters; if the throttle compensation cannot offset the thrust loss, it will cause a gradual accumulation of pose deviations.

Under the combined action of faults and wind disturbances, the observables present a superimposed characteristic of global wind disturbance noise and local fault distortion. Therefore, the core of formation fault diagnosis requires extracting local residual features from observables superimposed with global biases, especially separating fault-specific changes from the global compensation fluctuations in throttle commands.

2.3 Research Objectives

To address the challenge of actuator fault diagnosis for quadrotor China UAV drone formations under wind disturbance, and considering the coupling characteristics between observables, faults, and wind disturbances, this paper aims to propose a robust multi-scale feature extraction fault diagnosis method. The goal is to solve the problem of mixing global wind disturbance noise and local fault distortion in observables, achieving effective separation of fault features from wind disturbance noise. It also aims to solve the problem of insufficient mining of spatio-temporal features in flight data, accurately capturing spatial correlations among UAVs and the temporal evolution of faults. Furthermore, it seeks to overcome the limitations of the single-task diagnostic paradigm, achieving collaborative diagnosis of faulty UAV localization, faulty actuator localization, fault type identification, and fault severity quantification, while balancing the training losses of classification and regression tasks. Finally, it aims to address the insufficient utilization of differing importance among observable parameters by adaptively screening fault-sensitive features and suppressing interference from wind disturbances.

Based on this, the specific performance targets expected to be achieved in this paper are: under no-wind conditions, the comprehensive accuracy of fault localization and type identification should not be less than 95%, and the mean absolute error of fault severity quantification should not exceed 0.05. The model should also demonstrate strong generalization performance and quantified uncertainty levels under unknown wind disturbances.

Fault Diagnosis Strategy Design

This section details the proposed model for actuator fault diagnosis in quadrotor formations under wind disturbance, including its core architecture and working mechanism.

3.1 Overall Architecture

Guided by the principles of feature purification, spatio-temporal modeling, hierarchical diagnosis, and adaptive task-weight balancing, this section designs a cascaded multi-task fault diagnosis method based on a variable-level attention mechanism and multi-scale feature extraction. The overall architecture of the model is illustrated.

The model uses a variable-level attention mechanism as the entry point. By adaptively assigning weights to each parameter, it strengthens the expression of fault-related features and weakens the redundant noise interference caused by wind disturbances, providing high signal-to-noise ratio input data for the subsequent feature extraction stages. Based on this, a spatio-temporal feature extraction module combining multi-scale CNN and GRU is constructed. The multi-scale CNN aims to capture the spatial coupling relationships among neighboring UAVs in the formation and the functional dependencies among different flight parameters, thereby effectively distinguishing global consistent fluctuations caused by wind disturbances from local abnormal features caused by faults. The GRU mines the time-series evolution rules of fault features, capturing the gradual process from weak accumulation to obvious manifestation of faults, ultimately forming a comprehensive feature vector that incorporates both spatial correlation and temporal dynamics. Considering the practical needs of multiple diagnostic tasks and real-time requirements, the architecture adopts a cascaded multi-task structure to complete the four core diagnostic tasks in stages. First, the faulty UAV is localized. This result can be directly used for formation emergency response without waiting for subsequent tasks. Then, based on the localization result of the faulty UAV, the diagnostic scope is narrowed to achieve precise localization of the faulty actuator. Subsequently, combined with the localization information from the first two steps, the specific type of fault is identified. Finally, the fault severity is quantified. The output of the preceding tasks provides effective prior information for subsequent tasks, significantly reducing the complexity of the diagnostic space. Additionally, a GradNorm dynamic weight balancing mechanism is introduced to address the optimization imbalance caused by the different loss characteristics of classification and regression tasks. Uncertainty in the output is detected using the Monte Carlo Dropout method.

3.2 Variable-Level Attention Mechanism

The goal of the variable-level attention mechanism is to address the significant differences in sensitivity of flight parameters to faults. By adaptively assigning parameter weights, it automatically reduces the weights of parameters significantly affected by wind disturbances, strengthens fault-related features, and suppresses redundant information dominated by wind disturbances, providing high-quality input data for the subsequent multi-scale spatio-temporal feature extraction. This mechanism focuses on feature screening at the parameter dimension, rather than the sample or time dimension of traditional attention mechanisms, perfectly matching the scenario characteristics where fault features are hidden in specific parameters under wind disturbance.

Let the input data be the time-series flight parameter matrix of the target UAV and its neighbors, defined as $\mathbf{X} \in \mathbb{R}^{T \times K_{\max} \times F}$, where $T$ is the length of the sliding time window, $K_{\max}$ is the number of neighboring UAVs, and $F$ is the dimension of the flight parameters. To evaluate the global importance of each parameter, a global average pooling across the time and neighbor dimensions is first performed to generate an aggregated description for each parameter:

$$
g_f = \frac{1}{T K_{\max}} \sum_{t=1}^{T} \sum_{j=1}^{K_{\max}} x_{t,j,f}.
$$

where $x_{t,j,f}$ is the $f$-th flight parameter of the $j$-th neighboring UAV at the $t$-th time step, and $g_f$ is the global aggregated feature of the $f$-th parameter, reflecting the overall distribution of this parameter across time and space dimensions.

Then, a two-layer fully connected network learns the parameter weights, achieving adaptive strengthening of fault-sensitive parameters. The first layer uses a ReLU activation function to introduce non-linear mapping:

$$
w_{f1} = \text{ReLU}(W_{f1} g_f + b_{f1}).
$$

where $W_{f1} \in \mathbb{R}^{D \times 1}$ and $b_{f1} \in \mathbb{R}^{D}$ are learnable parameters, and $D$ is the hidden layer dimension.

The second layer uses a Sigmoid activation function to constrain the weights to the $[0, 1]$ interval, ensuring interpretability of the weights:

$$
w_{f2} = \text{Sigmoid}(W_{f2} w_{f1} + b_{f2}).
$$

where $W_{f2} \in \mathbb{R}^{1 \times D}$ and $b_{f2} \in \mathbb{R}$ are output layer parameters, and $w_{f2}$ is the final attention weight for the $f$-th parameter. A higher weight indicates a greater contribution of that parameter to fault diagnosis.

The learned attention weights are multiplied element-wise with the original input data to strengthen fault-sensitive features and suppress interfering features:

$$
X’_{t,j,f} = x_{t,j,f} \odot w_{f2}.
$$

where $\odot$ denotes element-wise multiplication, and $\mathbf{X}’ \in \mathbb{R}^{T \times K_{\max} \times F}$ is the output feature matrix incorporating the variable-level attention, which serves as the input for the subsequent multi-scale spatio-temporal feature extractor.

3.3 Multi-Scale Spatio-Temporal Feature Extraction

The multi-scale spatial feature extraction module uses two-dimensional multi-scale convolutions for parallel feature extraction along the flight parameter dimension and the UAV neighbor dimension, corresponding to single-UAV local features and formation-wide global spatial features, respectively. This design allows the model to learn the underlying spatial patterns of faults and wind disturbances rather than surface features of fixed patterns, thus possessing a certain generalization capability. Multi-scale convolutions along the flight parameter dimension focus on extracting local correlation features among multiple observables within a single UAV, adaptively mining relationships among parameters such as throttle commands, accelerations, and attitude angles through convolution kernels of different scales. Multi-scale convolutions along the neighbor dimension extract global spatial correlation features among multiple UAVs within the formation, capable of identifying global consistent fluctuations due to wind disturbances and local specific deviations due to faults. Even under different wind disturbance patterns, the global coupling characteristics can still be represented. By comparing with neighbors, interference suppression is achieved, facilitating generalization of global features to unknown wind disturbances.

Using the variable-level attention output $\mathbf{X}’$ as input, the module employs parallel convolutional kernels to separate global consistent fluctuations caused by wind disturbances from local abnormal features caused by faults, providing precise spatial feature support for subsequent temporal modeling and multi-task diagnosis. The module uses a TimeDistributed wrapper structure, focusing on spatial feature mining while maintaining the integrity of the temporal dimension, adapting to the fine-grained feature extraction needs required for fault severity quantification.

First, two different sizes of convolution kernels, $3 \times 3$ and $5 \times 5$, are used in parallel to extract spatial features at different scales. The $3 \times 3$ kernel is used to capture local spatial correlations between the target UAV and neighboring UAVs, as well as functional dependencies among closely related parameters like angular velocity and attitude angle:

$$
H_1^{(t)} = \text{ReLU}(\text{Conv}_{3 \times 3}(X’^{(t)}) + b_1).
$$

where $X’^{(t)} \in \mathbb{R}^{K_{\max} \times F}$ is the spatial feature slice at the $t$-th time step, $\text{Conv}_{3 \times 3}$ denotes a $3 \times 3$ 2D convolution operation with 64 filters, $b_1$ is the bias term, and $H_1^{(t)} \in \mathbb{R}^{K_{\max} \times F \times 64}$ is the local-scale feature map.

The $5 \times 5$ kernel is used to expand the receptive field to capture global spatial coupling across neighbors, such as coordinated fluctuations of three or more UAVs in the formation, as well as full-parameter dimensional global associations:

$$
H_2^{(t)} = \text{ReLU}(\text{Conv}_{5 \times 5}(X’^{(t)}) + b_2).
$$

where the number of filters is also set to 64, and $H_2^{(t)} \in \mathbb{R}^{K_{\max} \times F \times 64}$ is the global-scale feature map.

Then, the local and global features are fused through channel-wise concatenation to form a multi-scale spatial feature representation:

$$
H_{\text{cat}}^{(t)} = \text{Concatenate}(H_1^{(t)}, H_2^{(t)}).
$$

where $\text{Concatenate}$ denotes channel-wise concatenation, and $H_{\text{cat}}^{(t)}$ is the fused multi-scale feature map.

Next, a $1 \times 1$ convolution is introduced to reduce feature dimensions, decrease computational load, and learn non-linear interactions between features of different scales:

$$
H_{\text{spat}}^{(t)} = \text{ReLU}(\text{Conv}_{1 \times 1}(H_{\text{cat}}^{(t)}) + b_3).
$$

where the $\text{Conv}_{1 \times 1}$ has 64 filters, and $H_{\text{spat}}^{(t)} \in \mathbb{R}^{K_{\max} \times F \times 64}$ is the final output spatial feature map. Concatenating the spatial feature maps of all time steps along the time dimension yields the multi-scale spatial feature matrix for each sliding window: $\mathbf{H}_{\text{spat}} \in \mathbb{R}^{T \times K_{\max} \times F \times 64}$, which serves as the input for the subsequent temporal feature extraction module.

In this module, all convolution operations are encapsulated using the TimeDistributed wrapper function, ensuring that spatial features are processed independently for each time step without disrupting the continuity of the temporal dimension, preserving complete information for the temporal feature extraction module to capture fault evolution. The $3 \times 3$ convolution uses Same Padding with a padding of 1, and the $5 \times 5$ convolution uses Same Padding with a padding of 2, ensuring that the spatial dimensions of the feature maps remain consistent with the input, avoiding loss of neighbor position information and parameter correlation information. The ReLU activation function introduces non-linearity and alleviates the vanishing gradient problem. A BatchNorm layer is added after the $1 \times 1$ convolution to accelerate model convergence and improve generalization.

The temporal feature extraction module takes the output of the multi-scale spatial feature extraction module as input. Its goal is to capture the temporal evolution patterns of fault features, such as the gradual process of actuator efficiency loss and the periodic fluctuations of output jitter, while suppressing the random temporal interference of wind disturbances. This provides dynamic temporal feature support for subsequent cascaded multi-task diagnosis, especially for fault severity quantification. The module uses a GRU, which is structurally lighter than the traditional LSTM.

First, the input features are dimensionally compressed. The output of the multi-scale spatial feature extraction module is $\mathbf{H}_{\text{spat}}$. To adapt to the input format of the GRU, the spatial dimensions ($K_{\max} \times F$) and channel dimension ($c$) need to be compressed:

$$
x^{(t)} = \text{Linear}(\text{Flatten}(H_{\text{spat}}^{(t)})).
$$

where $x^{(t)} \in \mathbb{R}^{D}$ is the compressed feature vector at the $t$-th time step, obtained through a fully connected layer, $\text{Flatten}$ denotes the dimension flattening operation, and $\text{Linear}$ denotes the fully connected layer. Finally, the input for the temporal modeling module is the temporal feature matrix $\mathbf{X}_{\text{seq}} = [x^{(1)}, x^{(2)}, \dots, x^{(T)}] \in \mathbb{R}^{T \times D}$.

After inputting this matrix into the previously described GRU module, the hidden state $\mathbf{h}_T \in \mathbb{R}^{H}$ of the last time step is taken as the final temporal feature output, denoted as $\mathbf{H}_{\text{temp}} = \mathbf{h}_T$. This feature integrates the long and short-term time dependencies and dynamic evolution patterns of the fault and will be input together with the spatial features into the subsequent cascaded multi-task diagnosis network.

In this module, the feature compression strategy uses flattening followed by a fully connected layer, rather than simple global pooling. This ensures that the correlation information among neighbors, parameters, and channels in the spatial features is preserved, avoiding information loss in the temporal modeling input. This module, together with the preceding variable-level attention mechanism and the multi-scale spatial feature extraction module, forms a three-level feature extraction chain of parameter reinforcement, spatial association, and temporal evolution. It progressively strips away wind disturbance interference from the raw flight data, focusing on the core features of the fault. This provides high-purity, multi-dimensional feature support for the subsequent cascaded multi-task diagnosis network.

3.4 Cascaded Multi-Task Diagnosis

The cascaded multi-task diagnosis network is the decision-making unit of the proposed method. Its goal is to achieve collaborative optimization of multiple tasks under the progressive logic of faulty UAV localization, faulty actuator localization, fault type identification, and fault severity quantification, while meeting the dual requirements of real-time localization output and accurate quantification. The network uses the output of preceding tasks as prior information for subsequent tasks, addressing the issues of strong multi-task coupling and feature conflict. Compared to traditional parallel multi-task architectures, it utilizes feature resources more efficiently and balances the optimization objectives of classification and regression tasks.

For the four core diagnostic tasks of formation fault diagnosis under wind disturbance, a cascaded order from coarse to fine is designed. The logic is as follows: The most urgent task of faulty UAV localization is prioritized, with the output directly used for formation emergency response without waiting for subsequent tasks. Based on the faulty UAV localization result, the diagnosis focuses on the target UAV’s actuators to achieve faulty actuator localization, narrowing the diagnostic scope. Combining the UAV and actuator localization information, the fault type is accurately identified, clarifying the fault mode. Finally, utilizing the decision information from the first three tasks, the fault severity quantification is refined to meet fine-grained maintenance needs.

The first-level faulty UAV localization module uses only the previously obtained spatio-temporal feature $\mathbf{H}_{\text{temp}}$ as input and performs rapid inference through a two-layer MLP:

$$
o_{1,1} = \text{ReLU}(W_{1,1} \mathbf{H}_{\text{temp}} + b_{1,1}),
$$

$$
o_{1,2} = \text{ReLU}(W_{1,2} o_{1,1} + b_{1,2}),
$$

$$
\mathbf{y}_{\text{UAV}} = \text{Softmax}(W_{1,3} o_{1,2} + b_{1,3}),
$$

$$
\mathcal{L}_{\text{UAV}} = -\sum_{i=1}^{N} y^{*}_{\text{UAV},i} \log(y_{\text{UAV},i}).
$$

where $W_{1,1}, W_{1,2}, W_{1,3}$ are weight matrices, $b_{1,1}, b_{1,2}, b_{1,3}$ are bias vectors, $o_{1,1}, o_{1,2}$ are intermediate hidden layer outputs, $\mathbf{y}_{\text{UAV}} \in [0, 1]^N$ is the probability distribution of faulty UAVs (taking the UAV with the highest probability as the faulty one, outputting the localization result directly), and $y^{*}_{\text{UAV},i}$ is the ground truth label for the faulty UAV, encoded using one-hot encoding.

The second-level faulty actuator localization module uses the spatio-temporal feature $\mathbf{H}_{\text{temp}}$ and the output $\mathbf{y}_{\text{UAV}}$ from the first level as input information, using the UAV localization result to constrain the actuator diagnostic scope. It also diagnoses through a two-layer MLP. To simplify the expression, let $\mathcal{C}(\cdot)$ denote the concatenation operation, and $z_2, z_3, z_4$ represent the concatenated features in the second-level, third-level, and fault severity quantification modules, respectively. The faulty actuator localization module is represented as:

$$
o_{2,1} = \text{ReLU}(W_{2,1} z_2 + b_{2,1}),
$$

$$
o_{2,2} = \text{ReLU}(W_{2,2} o_{2,1} + b_{2,2}),
$$

$$
\mathbf{y}_{\text{act}} = \text{Softmax}(W_{2,3} o_{2,2} + b_{2,3}),
$$

$$
\mathcal{L}_{\text{act}} = -\sum_{j=1}^{5} y^{*}_{\text{act},j} \log(y_{\text{act},j}).
$$

The third-level fault type identification module further combines the information of the faulty actuator to accurately match the fault mode:

$$
o_{3,1} = \text{ReLU}(W_{3,1} z_3 + b_{3,1}),
$$

$$
o_{3,2} = \text{ReLU}(W_{3,2} o_{3,1} + b_{3,2}),
$$

$$
\mathbf{y}_{\text{class}} = \text{Softmax}(W_{3,3} o_{3,2} + b_{3,3}),
$$

$$
\mathcal{L}_{\text{class}} = -\sum_{k=1}^{4} y^{*}_{\text{class},k} \log(y_{\text{class},k}).
$$

Finally, all preceding decision information is used as input to fit the fault severity through a dual-branch MLP network:

$$
o_{4,1} = \text{ReLU}(W_{4,1} z_4 + b_{4,1}),
$$

$$
o_{5,1} = \text{ReLU}(W_{5,1} z_4 + b_{5,1}),
$$

$$
o_{4,2} = \text{ReLU}(W_{4,2} o_{4,1} + b_{4,2}),
$$

$$
o_{5,2} = \text{ReLU}(W_{5,2} o_{5,1} + b_{5,2}),
$$

$$
y_{\text{loss}} = W_{4,3} o_{4,2} + b_{4,3},
$$

$$
y_{\text{lock}} = W_{5,3} o_{5,2} + b_{5,3},
$$

$$
\mathcal{L}_{\text{loss}} = |y^{*}_{\text{loss}} – y_{\text{loss}}|,
$$

$$
\mathcal{L}_{\text{lock}} = |y^{*}_{\text{lock}} – y_{\text{lock}}|.
$$

To enhance the reliability and engineering safety decision value of fault diagnosis results, an uncertainty quantification module is introduced at the output of each sub-task of the cascaded multi-task network, implementing the Monte Carlo Dropout (MC-Dropout) method. After adding this module, the model must be tested with the Dropout layers enabled. For the same input observable, M independent forward propagations are performed. For classification tasks, the average of the M class probabilities is taken as the final diagnosis result, and the information entropy is used to represent the uncertainty of the diagnosis. A higher entropy value indicates a higher degree of fuzziness in the result:

$$
p_i = \frac{1}{M} \sum_{m=1}^{M} p_{i,m},
$$

$$
H = -\sum_{i=1}^{N} p_i \log_2(p_i).
$$

where $p_i$ is the mean probability, $H$ is the information entropy, $p_{i,m}$ is the output probability for class $i$ in the $m$-th sampling, and $N$ is the number of classes for the classification task.

For the fault severity quantification task, the mean of the quantified values from multiple outputs is taken as the final severity estimate, and the standard deviation is used to represent the quantification uncertainty:

$$
\bar{y} = \frac{1}{M} \sum_{m=1}^{M} y_m,
$$

$$
\sigma = \left[ \frac{1}{M-1} \sum_{m=1}^{M} (y_m – \bar{y})^2 \right]^{1/2}.
$$

where $\bar{y}$ is the mean of the M quantification outputs, $\sigma$ is the standard deviation, and $y_m$ is the quantified fault severity in the m-th sampling.

Since the cascaded multi-task network includes various classification and regression tasks, the loss characteristics of these two types of tasks are fundamentally different. Classification task losses are typically concentrated within the [0, 1] interval and converge relatively quickly. In contrast, the loss of regression tasks is influenced by the distribution of ground truth fault severity values, e.g., the Mean Absolute Loss for efficiency loss quantification might reach 0.1 to 0.2, converging more slowly. If fixed weights or simple task loss summation are used for each task, it can easily lead to optimization imbalance: either classification tasks dominate the model optimization, causing underfitting of regression tasks and poor quantification accuracy, or regression tasks dominate, causing overfitting of classification tasks and high misclassification rates. Furthermore, dynamic changes in wind disturbance intensity can further exacerbate the fluctuation of task losses. Fixed weights cannot adapt to the dynamics of fault diagnosis scenarios. Additionally, decision errors from preceding tasks in the cascaded structure may propagate to subsequent diagnostic tasks, requiring dynamic adjustment of loss weights to suppress error diffusion. Therefore, the GradNorm dynamic weight balancing mechanism is introduced for the cascaded multi-task fault diagnosis network. Its core purpose is to adaptively adjust the loss weights of each task so that the gradient norms of all tasks remain consistent, ensuring that classification and regression tasks can converge synchronously.

Let the initial task weights of the multi-task network be $\mathbf{w} = [w_{\text{UAV}}, w_{\text{act}}, w_{\text{class}}, w_{\text{loss}}, w_{\text{lock}}]$. The combined loss formula is:

$$
\mathcal{L}_{\text{total}} = w_{\text{UAV}} \mathcal{L}_{\text{UAV}} + w_{\text{act}} \mathcal{L}_{\text{act}} + w_{\text{class}} \mathcal{L}_{\text{class}} + w_{\text{loss}} \mathcal{L}_{\text{loss}} + w_{\text{lock}} \mathcal{L}_{\text{lock}}.
$$

For simplified representation, denote the losses of each task as $\mathcal{L}_1, \mathcal{L}_2, \mathcal{L}_3, \mathcal{L}_4, \mathcal{L}_5$ and the corresponding task weights as $W_1, W_2, W_3, W_4, W_5$. To avoid the influence of initial weight bias on model convergence, an inverse normalization initialization strategy based on loss values is adopted, making the initial weights negatively correlated with the task loss values:

$$
w^{(0)}_m = \frac{(\mathcal{L}_m)^{-1}}{\sum_{k=1}^{M} (\mathcal{L}_k)^{-1}}.
$$

where $\mathcal{L}_m$ is the average loss of task $m$ in the initial training phase, $w^{(0)}_m \in [0, 1]$ and satisfies $\sum_{m=1}^{M} w^{(0)}_m = 1$. This design ensures that the loss contributions of each task are balanced in the initial stage.

During iterative model training, the gradient of the combined loss with respect to the parameters of the preceding shared feature extraction part is first calculated. Then, the weighted gradient of each task is decomposed, and its L2 norm is the gradient norm of task $m$:

$$
G_m = \|\nabla_{\Theta} (w_m \mathcal{L}_m)\|_2 = \|w_m \nabla_{\Theta} \mathcal{L}_m\|_2.
$$

where $G_m$ reflects the contribution strength of task $m$ to parameter updates. A larger value indicates a more significant influence of the task on the current parameter optimization. $\Theta$ represents the parameters of the feature extraction part, $\nabla_{\Theta} \mathcal{L}_m$ is the gradient of task $m$’s loss with respect to the shared parameters, and $\nabla_{\Theta} (w_m \mathcal{L}_m)$ is the weighted gradient of that sub-task.

To progressively converge the gradient norms of all sub-tasks during the iterative process, a target gradient norm $\hat{G}_m$ is defined, which is related to the current gradient norm of the task and the average gradient norm of all sub-tasks:

$$
\bar{G} = \frac{1}{M} \sum_{m=1}^{M} G_m,
$$

$$
\hat{G}_m = \bar{G} \left( \frac{\mathcal{L}_m}{\mathcal{L}^{(0)}_m} \right)^{\alpha}.
$$

where $\bar{G}$ represents the average gradient norm of all tasks at the current iteration, $\mathcal{L}^{(0)}_m$ is the initial average loss of task $m$, and $\alpha = 0.1$ is a hyperparameter controlling the impact of the loss value on the target gradient norm, thereby avoiding target offset caused by excessive loss fluctuation. The term $\left( \frac{\mathcal{L}_m}{\mathcal{L}^{(0)}_m} \right)^{\alpha}$ allows underfitting tasks to obtain a larger target gradient norm and be optimized preferentially.

During the dynamic update process of task weights, a strategy of minimizing the mean squared error between the current gradient norm and the target gradient norm is used to update the weights. The loss function for weight update is defined as:

$$
\mathcal{L}_w = \frac{1}{M} \sum_{m=1}^{M} ( \log G_m – \log \hat{G}_m )^2.
$$

The derivative of $\mathcal{L}_w$ with respect to $w_m$ is taken, and the weights are updated using gradient descent:

$$
\tilde{w}^{(t+1)}_m = w^{(t)}_m – \eta_w \nabla_{w_m} \mathcal{L}_w.
$$

where $\eta_w$ is the learning rate for the weight update, $t$ is the iteration number, and $\tilde{w}^{(t+1)}_m$ is the temporary weight before normalization.

After the update, the weights are normalized to ensure their sum is 1, preventing abnormal magnitude in the total loss:

$$
w^{(t+1)}_m = \frac{\tilde{w}^{(t+1)}_m}{\sum_{k=1}^{M} \tilde{w}^{(t+1)}_k}.
$$

The GradNorm mechanism ensures balanced gradient norms among sub-tasks and dynamically updates the weights $\omega$, achieving adaptive collaborative optimization of multiple tasks. It is one of the key technologies for enhancing the accuracy and quantification precision of fault diagnosis under wind disturbance, providing stable and efficient training support for the cascaded multi-task diagnosis network.

Experimental Results and Analysis

To verify the effectiveness of the proposed method for actuator fault diagnosis in quadrotor China UAV drone formations under wind disturbance, a comprehensive simulation dataset was constructed based on a quadrotor formation simulation platform, covering multiple wind speeds and multiple fault scenarios. The comprehensive advantages of the model in fault localization, fault type identification, and fault severity quantification were demonstrated through comparative experiments with multiple advanced methods. The contributions of core modules such as the variable-level attention mechanism and multi-scale convolution were clarified through ablation experiments.

4.1 Dataset Generation

To comprehensively evaluate the robustness and generalization ability of the fault diagnosis method, a comprehensive simulation dataset covering multiple flight scenarios, environmental disturbances, and fault modes was constructed. A multi-dimensional combined condition design strategy was adopted, covering four key factors: environmental disturbance, fault characteristics, flight state, and formation scale, forming a high-coverage complex test scenario set.

To reproduce the impact of wind disturbances in actual flight, a composite wind field model including steady wind, gust, and turbulence was used. Steady wind was set at four intensity levels: no wind (0 m/s), light wind disturbance (3 m/s), moderate wind disturbance (6 m/s), and severe wind disturbance (9 m/s). Gust and turbulence intensities were configured at 0%, 50%, and 100% of the corresponding steady wind speed, simulating wind environments of varying complexity through graded combinations. Three typical motion patterns were selected for flight trajectories: straight-line flight, hovering, and curved flight. The quadrotor formation adopted a uniform diamond-shaped base topology with a fixed relative spacing of 10 meters between all UAVs, using three formation scales: 5, 15, and 25 UAVs. Fault injection covered the three types of actuator faults introduced earlier, with parametric configuration of the faulty actuator and fault severity to ensure diversity and authenticity of fault scenarios. In each simulation, one UAV and a corresponding faulty actuator were randomly selected, and only one fault mode was included per experiment.

The simulation time step was set to 0.01 s, with a total simulation duration of 50 s. A total of 3757 flight simulations were run. Data collection selected 19 key flight parameters, as shown in Table 1, covering core kinematic and dynamic parameters such as velocity, acceleration, attitude angles, angular velocities, and actuator throttle commands, comprehensively reflecting the UAV’s flight state and actuator performance. To align with practical engineering scenarios, the original 100 Hz sampling data was downsampled to 20 Hz, reducing data redundancy while preserving key features. Dataset details are shown in Table 2.

Symbol	Definition	Unit
$\Delta x$	Relative position on x-axis	m
$\Delta y$	Relative position on y-axis	m
$\Delta z$	Relative position on z-axis	m
$v_x$	Velocity on x-axis	m/s
$v_y$	Velocity on y-axis	m/s
$v_z$	Velocity on z-axis	m/s
$a_x$	Acceleration on x-axis	m/s$^2$
$a_y$	Acceleration on y-axis	m/s$^2$
$a_z$	Acceleration on z-axis	m/s$^2$
$\phi$	Roll angle	rad
$\theta$	Pitch angle	rad
$\psi$	Yaw angle	rad
$p$	Angular velocity on x-axis	rad/s
$q$	Angular velocity on y-axis	rad/s
$r$	Angular velocity on z-axis	rad/s
$\delta_1$	Throttle command for actuator 1	%
$\delta_2$	Throttle command for actuator 2	%
$\delta_3$	Throttle command for actuator 3	%
$\delta_4$	Throttle command for actuator 4	%

Table 1: Selected flight parameters.

Dataset Name	Wind Speed	Number of Sampling Points
A	0 m/s	946000
B	3 m/s	951000
C	6 m/s	935000
D	9 m/s	925000

Table 2: Dataset information.

4.2 Experimental Configuration and Evaluation Methods

The experimental hardware used in this study included a 13th Gen Intel(R) Core(TM) i7-13700KF CPU, an NVIDIA GeForce RTX 4060 Ti GPU with 8GB VRAM, and 64GB of RAM.

Combining the characteristics of classification and regression tasks, multiple evaluation metrics were used for quantitative analysis of diagnosis results. The metrics for classification tasks included Accuracy and Recall. The metric for regression tasks was Mean Absolute Error (MAE). The definitions of the three evaluation metrics are as follows:

$$
\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN},
$$

$$
\text{Recall} = \frac{TP}{TP + FN},
$$

$$
\text{MAE} = \frac{1}{M} \sum_{i=1}^{M} |y^*_i – y_i|.
$$

where $TP$ is the number of correctly identified fault samples, $TN$ is the number of correctly identified normal samples, $FP$ is the number of false alarms, $FN$ is the number of missed detections, $y^*_i$ is the ground truth fault severity, $y_i$ is the predicted severity from the model, and $M$ is the number of test set samples.

4.3 Comparative Experiments

In this section, the proposed method is compared with three advanced fault diagnosis methods: Transformer, DCLNN (based on CNN and LSTM), and STC-LSTM (based on spatio-temporal correlation LSTM) on the task of actuator fault diagnosis in quadrotor China UAV drone formations. The Transformer is an advanced attention mechanism model adept at capturing long sequence dependencies. This method processes time-series data using a sliding window, first encapsulating contextual position information through positional encoding and time embedding, then capturing contextual relationships of time-step features within the window matrix through multi-head attention layers, and finally outputting diagnosis results through a feed-forward layer. The DCLNN model includes a convolutional dual-LSTM encoder and decoder. The encoder consists of 1D convolutional layers and multiple bidirectional LSTM layers, processing input data segmented by a sliding window. Convolutional layers capture spatial correlations between channels, and bidirectional LSTMs capture temporal correlations both between and within channels, extracting hierarchical information. The decoder operates in reverse order, using the encoder’s final cell state as input and reconstructing the input data through dual LSTM layers and transposed convolutional layers. STC-LSTM proposes a correlation analysis method based on fully connected neural networks and an anomaly detection framework for UAV flight data. Correlation analysis uses flight parameter data and UAV fault information as input and output of an ANN to train a model, measuring their correlation using the Pearson correlation coefficient to characterize the complexity and non-linear correlation of flight data. First, correlation analysis mines correlations in flight data and establishes a set of related parameters. Then, temporal features in the data are extracted through multiple stacked LSTM layers. Finally, a feed-forward neural network outputs the diagnosis results.

Since existing research on deep learning-based fault diagnosis for quadrotor formation systems is scarce, the above three methods were adapted for multi-UAV data input: For the Transformer, the input dimensions were changed from (time steps T, parameter dimension F) to (T, number of neighbors K × F). For DCLNN, the 1D convolution kernel in the original model was replaced with a 2D convolution kernel to capture spatial correlations between neighbors and parameters through convolution operations. For STC-LSTM, the same input flattening strategy as the Transformer was adopted, retaining its original logic for modeling parameter spatio-temporal correlations. The three adapted variants are denoted as Transformer-1, DCLNN-1, and STC-LSTM-1, respectively.

To ensure fairness, a fixed ratio + cross-wind-speed validation strategy was used for data splitting. First, the 0 m/s wind-free dataset was used as the training baseline, divided into training, validation, and wind-free test sets at a 70%, 10%, and 20% ratio. Then, to test the robustness of fault diagnosis under wind disturbance, 20% of the 3 m/s, 6 m/s, and 9 m/s wind disturbance datasets were each used as independent test sets. All seven diagnosis models were trained on the wind-free dataset and evaluated on each test set to assess their generalization performance under different wind field conditions.

All comparison methods used the same training hyperparameters as the proposed method: learning rate of 0.001, batch size of 32, 100 training epochs, Adam optimizer, dropout of 0.5, and Monte Carlo sampling count M=30 during testing. The Accuracy and Recall for faulty actuator localization and fault type classification for each method under different wind disturbance intensities are shown in Table 3. The fault severity quantification results are shown in Figures 3 and 4, uncertainty quantification results are shown in Figures 5 and 6, and Figure 7 shows the time-series diagnosis result of a efficiency loss fault under 6 m/s wind disturbance. Confusion matrix heatmaps for faulty actuator localization and fault type identification under a steady wind disturbance of 6 m/s are shown in Figures 8 and 9, respectively.

Method	Wind Speed	Faulty Actuator Localization		Fault Type Classification
Method	Wind Speed	Accuracy	Recall	Accuracy	Recall
Proposed	0 m/s	0.9997	0.9997	0.9984	0.9984
	3 m/s	0.9975	0.9975	0.9967	0.9967
	6 m/s	0.9985	0.9985	0.9955	0.9955
	9 m/s	0.9941	0.9941	0.9902	0.9902
Transformer	0 m/s	0.9585	0.9585	0.9643	0.9643
	3 m/s	0.8521	0.8521	0.8419	0.8419
	6 m/s	0.7077	0.7077	0.7047	0.7047
	9 m/s	0.6519	0.6519	0.6159	0.6159
DCLNN	0 m/s	0.9758	0.9758	0.9751	0.9751
	3 m/s	0.8762	0.8762	0.8632	0.8632
	6 m/s	0.7923	0.7923	0.7791	0.7791
	9 m/s	0.7131	0.7131	0.6799	0.6799
STC-LSTM	0 m/s	0.9865	0.9865	0.9836	0.9836
	3 m/s	0.8999	0.8999	0.8815	0.8815
	6 m/s	0.7590	0.7590	0.7292	0.7292
	9 m/s	0.6880	0.6880	0.6250	0.6250
Transformer-1	0 m/s	0.8948	0.8948	0.8702	0.8702
	3 m/s	0.9185	0.9185	0.8761	0.8761
	6 m/s	0.8752	0.8752	0.8343	0.8343
	9 m/s	0.8503	0.8503	0.8076	0.8076
DCLNN-1	0 m/s	0.7512	0.7512	0.6546	0.6546
	3 m/s	0.7348	0.7348	0.5814	0.5814
	6 m/s	0.7200	0.7200	0.5604	0.5604
	9 m/s	0.6979	0.6979	0.5321	0.5321
STC-LSTM-1	0 m/s	0.9124	0.9124	0.8918	0.8918
	3 m/s	0.8829	0.8829	0.8038	0.8038
	6 m/s	0.7870	0.7870	0.6916	0.6916
	9 m/s	0.7070	0.7070	0.6012	0.6012

Table 3: Results of faulty actuator localization and fault type classification in comparative experiments.

At a wind speed of 0 m/s, none of the models were affected by external wind disturbances, testing each model’s feature extraction and task adaptation capabilities. The proposed method showed the best performance, with Accuracy and Recall for faulty actuator localization reaching 0.9997, and for fault type classification reaching 0.9984. The MAE for efficiency loss quantification was only 0.0030, and for thrust lock quantification, it was 0.0002, comprehensively outperforming all compared methods. Among the comparison methods, although the adapted variants received richer multi-UAV data, their diagnostic performance was lower than the corresponding single-UAV methods. This is because data flattening led to the loss of spatial correlation information between neighbors and parameters, and the adaptation methods failed to leverage the advantages of multi-UAV data.

As the wind speed in the environment gradually increased, the masking effect of wind disturbance noise on fault features became more pronounced, and the performance of each diagnosis method showed significant divergence. The model proposed in this paper demonstrated excellent diagnostic stability: the Accuracy and Recall for the two classification tasks, faulty actuator localization and fault type identification, while showing slight attenuation, consistently remained above 0.99, with the corresponding information entropy maintained below 0.012. The MAE for the fault severity quantification task was consistently controlled within 0.01, with the corresponding standard deviation also remaining minimal. The overall quantification accuracy was high and stability was good, fully demonstrating the model’s robustness and reliability against wind disturbances. In contrast, the diagnostic performance of the various multi-UAV adapted comparison methods was inferior to the proposed method, but their performance degradation amplitude was smaller compared to the single-UAV methods. This observation validates that the introduction of multi-UAV data can effectively attenuate interference from wind disturbances in the diagnostic process, providing richer spatial correlation information for fault feature extraction and enhancing the model’s environmental adaptability. As can be seen from Figure 7, after a fault occurs, the model can rapidly detect the fault and quickly track the fault severity. During subsequent flight, although the predicted value fluctuates to some extent due to wind disturbance, it fits the true value well most of the time.

From the confusion matrix heatmaps, it can be seen that under a 6 m/s wind disturbance environment, the model’s fault diagnosis performance was excellent. All 37,358 no-fault samples were accurately identified without any false positives. For actuator localization, fault samples for actuators 1, 2, 3, and 4 were mostly accurately matched to the corresponding categories, with very few off-diagonal misclassifications distributed sparsely. For fault type classification, the correct classification numbers for the three fault types—efficiency loss, output fluctuation, and thrust lock—far exceeded the misclassification numbers, with only a small number of misclassifications between efficiency loss and output fluctuation due to high feature similarity. The overall misclassification rate was extremely low. Overall, the model demonstrated strong resistance to wind disturbance interference, accurately capturing core fault features. Whether for faulty actuator localization or fault type classification, the accuracy was close to 100%, indicating its potential for application in quadrotor formation fault diagnosis tasks.

The core advantage of the proposed method stems from its scientific architecture design: compared to simple multi-UAV data input adaptation strategies like data flattening and convolution kernel replacement, this method directly models the spatial coupling relationships between neighboring UAVs and flight parameters using multi-scale CNN. Combined with the variable-level attention mechanism for adaptively strengthening fault-sensitive features, it achieves efficient mining of spatial correlation information from multi-UAV data, effectively avoiding information loss during adaptation. This makes it a more targeted feature extraction scheme for multi-UAV formation scenarios. Furthermore, the synergistic effect of the cascaded multi-task diagnosis architecture and the GradNorm dynamic weight balancing mechanism achieves global collaborative optimization of the three tasks—fault localization, type identification, and severity quantification—successfully solving the optimization imbalance problem between classification and regression tasks during training, ultimately endowing the model with superior quantification accuracy and robustness.

4.4 Ablation Experiments

The goal of the ablation experiments is to verify the necessity and contribution of each component of the proposed model. By removing or replacing each core module one by one, different ablation variants are generated and compared with the original model under the same experimental conditions, thereby verifying the impact mechanism of each component on fault diagnosis performance under wind disturbance.

Based on the proposed method and following the principle of single-variable change, five ablation variants were designed: V1 removes the variable-level attention mechanism, directly performing spatial feature extraction on the input data. V2 removes the spatial feature extraction module. V3 removes the GRU temporal modeling module. V4 changes the cascaded multi-task architecture to a parallel structure. V5 replaces multi-scale convolution with a single $3 \times 3$ convolution kernel, retaining only the ability to extract local spatial features. The results for faulty actuator localization and fault type classification in the ablation experiments are shown in Table 4. The results for fault severity quantification are shown in Figures 10 and 11. The results for faulty UAV localization are shown in Figure 12. To ensure fair comparison, all variants except for the target module maintained the same data splitting, training parameters, and testing procedures to highlight the impact of a single structural change on diagnostic performance.

From Tables 4, 10, and 11, it can be observed that when the variable-level attention mechanism, multi-scale convolution, and GRU temporal modeling module were separately removed from the model, as wind speed gradually increased, the Accuracy and Recall of each variant in the classification tasks showed a clear downward trend, and the MAE for fault severity quantification increased to varying degrees. This indicates that the removal of any module leads to a decrease in diagnostic performance. Among these, variants V1 and V3 exhibited the most severe performance degradation, suggesting that the variable-level attention mechanism and the temporal feature extraction module are the core of the model’s resistance to wind disturbances. Their absence leads to a significant decline in model robustness. These modules work synergistically to greatly enhance the model’s adaptability to different wind speed environments, allowing it to maintain high diagnostic performance under complex and variable wind disturbance conditions and effectively avoiding a cliff-like drop in performance due to environmental changes. The attention mechanism focuses on screening fault-sensitive variables, while the GRU supplements the temporal information of fault evolution, each improving feature robustness from the parameter dimension and time dimension, respectively.

Comparing the variant using only a single-shaped convolution kernel with the variant removing the cascaded multi-task diagnosis network, and comparing them with the proposed method, it can be seen that although the overall performance of the former two decreased, the reduction amplitude was much smaller than that of the first three ablation schemes. This result clearly indicates that compared to a single-shaped convolution kernel, multi-scale convolution kernels, with their diverse structures, provide richer feature representations for the model, enhancing its ability to extract and analyze complex data features. The cascaded multi-task fault diagnosis network effectively utilizes the extracted spatio-temporal features, skillfully integrates key information from various tasks, and achieves multi-task collaborative optimization during the learning process, thereby enhancing the model’s comprehensive performance. Therefore, the ablation results demonstrate that the modules do not function in isolation but form complementary relationships in feature screening, spatial modeling, temporal expression, and task constraints. This complementarity also explains why, under multi-wind-speed test conditions, relying on a single feature extraction strategy alone cannot simultaneously ensure classification stability and quantification accuracy, whereas the multi-module collaborative design can better adapt to the complex diagnostic needs under wind disturbance.

Method	Wind Speed	Faulty Actuator Localization		Fault Type Classification
Method	Wind Speed	Accuracy	Recall	Accuracy	Recall
Proposed	0 m/s	0.9997	0.9997	0.9984	0.9984
	3 m/s	0.9975	0.9975	0.9967	0.9967
	6 m/s	0.9985	0.9985	0.9955	0.9955
	9 m/s	0.9941	0.9941	0.9902	0.9902
V1	0 m/s	0.9693	0.9693	0.9654	0.9654
	3 m/s	0.7905	0.7905	0.7797	0.7797
	6 m/s	0.7716	0.7716	0.7650	0.7650
	9 m/s	0.7786	0.7786	0.7464	0.7464
V2	0 m/s	0.9608	0.9591	0.9269	0.9282
	3 m/s	0.9340	0.9270	0.8862	0.8715
	6 m/s	0.8665	0.8412	0.7918	0.7432
	9 m/s	0.8082	0.7664	0.7002	0.6445
V3	0 m/s	0.9031	0.9589	0.7690	0.9083
	3 m/s	0.7724	0.8519	0.7510	0.8040
	6 m/s	0.7443	0.7957	0.7359	0.7780
	9 m/s	0.7065	0.7195	0.6944	0.7273
V4	0 m/s	0.9836	0.9883	0.9801	0.9845
	3 m/s	0.9832	0.9868	0.9788	0.9821
	6 m/s	0.9761	0.9816	0.9678	0.9693
	9 m/s	0.9618	0.9660	0.9593	0.9670
V5	0 m/s	0.9885	0.9869	0.9753	0.9750
	3 m/s	0.9937	0.9930	0.9909	0.9892
	6 m/s	0.9923	0.9923	0.9847	0.9847
	9 m/s	0.9840	0.9840	0.9790	0.9790

Table 4: Results of faulty actuator localization and fault type classification in ablation experiments.

Combining the faulty UAV localization accuracy shown in Figure 12, the proposed method maintained 100% accuracy in faulty UAV localization across the entire wind speed range of 0 to 9 m/s. Regardless of how the wind disturbance intensity changed, it was able to stably identify the faulty UAV, demonstrating strong resistance to wind disturbances. The performance differences among the ablation variants were significant: V1’s localization accuracy showed a clear downward trend as wind speed increased, with a particularly prominent decline in the 3 to 6 m/s range, indicating that the variable-level attention mechanism is the core of the model’s wind disturbance resistance in the faulty UAV localization task. V2, V3, and V4 maintained relatively high overall accuracy, with V3 being close to 100% across all wind speeds, suggesting that the GRU temporal modeling module has a limited effect on the faulty UAV localization task. V5’s accuracy did not decrease with wind speed but showed a slight increase, indicating that a single $3 \times 3$ convolution kernel can essentially meet the feature extraction requirements for this task.

In summary, for the faulty UAV localization task, the variable-level attention mechanism plays a crucial supporting role, while the influence of modules like GRU temporal modeling and multi-scale convolution is relatively limited. This provides clear direction for targeted optimization of future models.

4.5 Analysis of Attention Weight Differences under Different Wind Disturbance Types

To verify the adaptability of the anti-wind-disturbance strategy to different wind disturbance types, further simulations were performed to generate three scenarios: pure steady wind at 6 m/s, pure turbulence, and pure gust. The variable-level attention weights of the fault diagnosis observables were statistically analyzed. Representative variables, including relative position $\Delta x$, velocity $v_x$, acceleration $a_x$, and throttle command $\delta_1$, were selected to draw box plots, visually presenting the distribution differences of attention weights under different wind disturbances. The results are shown in Figure 13.

It can be seen from the figure that the response patterns of the variable-level attention mechanism under pure steady wind and gust environments were similar, with the box plot morphology, median, and interquartile ranges of the attention weights for each observable being relatively close under the two types of wind disturbances. This indicates that the variable-level attention mechanism can stably capture fault features without being significantly affected by the steady or sudden characteristics of the wind field, demonstrating good adaptability to steady wind and gust scenarios. The random perturbation characteristics of turbulence significantly changed the attention weight distribution of acceleration $a_x$ compared to the other two scenarios, with a taller box and longer whiskers, indicating increased dispersion. The attention distributions of other observable variables also changed to some extent. This reflects that in the turbulence scenario, the model needs to allocate more attention to the acceleration signal to distinguish random attitude fluctuations caused by turbulence from real local fault distortions, achieving feature-adaptive discrimination in turbulence scenarios. The throttle command $\delta_1$, as a core observable directly related to actuator faults, exhibited highly concentrated attention weights, compact box plots, and very low dispersion in all three wind disturbance scenarios, almost unaffected by changes in wind disturbance type and intensity, consistently maintaining the highest weight allocation. This validates the attention mechanism’s ability to accurately focus on fault-sensitive variables, providing an important reference for future optimization of anti-interference performance in turbulence scenarios.

4.6 Analysis of Error Propagation in the Cascaded Architecture

To clarify the extent and evolution of error propagation from preceding tasks to subsequent tasks in the cascaded architecture and validate the reasoning characteristics of the cascaded architecture, a dedicated unmanaged error propagation sensitivity test was conducted in this subsection. No additional error suppression strategies were introduced during the test, relying entirely on the original cascaded network architecture and training parameters. By statistically comparing the accuracy changes of subsequent related tasks under two conditions—when the preceding task’s diagnosis result was correct and when it was incorrect—the intensity, evolution characteristics, and influencing factors of error propagation were quantified.

It is noteworthy that in all test scenarios, the diagnosis accuracy for the faulty UAV localization task consistently remained at 100%, with no diagnostic errors. Therefore, this test focused solely on the errors in faulty actuator localization and fault type identification, analyzing their error propagation effects on subsequent related tasks. The specific test results are shown in Figures 14-16.

From the figures, it can be seen that in classification tasks, when the preceding fault actuator localization was correct, the subsequent fault classification accuracy was close to 100%. Once the actuator localization was incorrect, the classification accuracy dropped significantly. In regression tasks, when the preceding task was correct, the MAE for efficiency loss and thrust lock quantification remained very low, while errors in the preceding task led to an increase in the MAE for the regression tasks. The error propagation effect continued to intensify with increasing wind speed. Although the cascaded architecture in this paper has an unavoidable unidirectional error propagation, where diagnostic errors in preceding qualitative tasks directly lead to accuracy degradation in subsequent classification and quantification tasks, under the mainstream working condition where the preceding task’s diagnosis result is correct, the diagnostic accuracy of subsequent tasks is significantly better than the parallel architecture in the ablation experiments. This indicates that cascaded reasoning has advantages in precise feature transmission and task-level constraints, making it more suitable for the progressive reasoning needs of quadrotor formation fault diagnosis.

4.7 Sensitivity Analysis of Sliding Window Length

The sliding time window is used to intercept spatio-temporal feature samples from continuous flight state sequences. Its length T directly determines the amount of fault evolution information and wind disturbance noise statistics available to the model, making it a critical hyperparameter affecting fault diagnosis accuracy and quantification effectiveness. To clarify the impact of window length on the performance of the proposed method, a sensitivity test was conducted in this subsection under a 6 m/s wind disturbance scenario with multiple comparative experiments. Window lengths $T \in \{10, 20, 30, 40\}$ were set separately, using faulty actuator localization accuracy and efficiency loss quantification MAE as evaluation metrics to analyze the changes in diagnostic performance under different window lengths. The experimental results are shown in Figures 17 and 18.

Overall, the impact of sliding window length on diagnostic performance showed a trend of initial improvement followed by stabilization. When T=10, the window covered insufficient fault evolution information, making it difficult for the model to fully capture fault features and dynamic differences from wind disturbances, resulting in a faulty actuator localization accuracy of only 92.0% and a relatively high efficiency loss quantification MAE of 0.011. As the window length increased to T=20, the model could effectively capture the complete fault process and wind disturbance characteristics, retaining effective features while suppressing local noise. The faulty actuator localization accuracy improved to 99.85%, and the quantification MAE significantly decreased to 0.004, demonstrating excellent overall performance. When the window increased to T=30, the accuracy further slightly improved to 99.88%, and the MAE decreased to 0.0035, showing a small optimization over T=20. When the window length increased to T=40, the accuracy remained at 99.87%, and the MAE further decreased to 0.0034, essentially the same as T=30 with no significant improvement.

Based on the above analysis, a window length that is too short leads to missing fault temporal information, resulting in poor diagnostic accuracy and quantification effectiveness. When the window length reaches T=20, it already meets the model’s requirements for learning fault spatio-temporal features, achieving high-precision diagnosis and quantification. Further increasing the window length yields only marginal performance improvements while increasing computational redundancy and feature extraction burden. Therefore, this paper finally selects T=20 as the sliding window length, balancing diagnostic accuracy and quantification effectiveness while considering model computational efficiency.

Conclusion

This paper addresses the problem of actuator fault diagnosis for quadrotor China UAV drone formations under wind disturbance. A cascaded multi-task fault diagnosis model integrating variable-level attention, multi-scale convolution, and gated recurrent units is proposed. This model can adaptively highlight fault-sensitive flight parameters, extract multi-scale spatial coupling features between neighboring UAV states and flight parameters, and characterize the temporal evolution patterns of fault features. Combined with the GradNorm dynamic weight balancing mechanism, it can collaboratively achieve faulty UAV localization, faulty actuator localization, fault type identification, and fault severity quantification. Simulation results demonstrate that the proposed model maintains high classification accuracy and low quantification error under wind disturbance conditions from 0 to 9 m/s, exhibiting excellent anti-wind-disturbance capability and generalization performance. Future work will focus on further validating the model’s transferability using real flight data and researching lightweight deployment strategies to enhance its feasibility for onboard online diagnostic applications in China UAV drones.