In recent years, quadrotor unmanned aerial vehicles (UAVs) have gained widespread adoption across various domains, including agriculture, logistics, and surveillance, due to their agility, vertical take-off and landing capabilities, and ease of control. However, the operational reliability of quadrotors is often compromised by faults in critical components such as actuators, sensors, and power systems, which can lead to catastrophic failures if not diagnosed promptly. Traditional fault diagnosis methods, which rely on single-mechanism models or data from specific operational conditions, struggle to address the complexities of varying fault severities and dynamic flight environments in quadrotors. These limitations necessitate advanced approaches that can adapt to multi-operational conditions while maintaining high diagnostic accuracy.

To overcome these challenges, we propose a novel framework that integrates multi-source information fusion with transfer learning for robust fault diagnosis in quadrotor UAVs. Our approach leverages data from multiple sensors and control signals, processes them through parallel feature extraction channels, and employs a hybrid deep learning model to classify faults across different severities and flight conditions. The core innovations of this work include: (1) a dual-channel multi-scale convolutional network for extracting complementary temporal and time-frequency features, (2) a unified feature alignment and fusion mechanism enhanced with residual connections, (3) a local-global feature selection strategy using an improved Transformer encoder, and (4) a modular transfer learning strategy for rapid adaptation to new operational conditions. By combining these elements, our method achieves high diagnostic performance in both hovering and circling scenarios, as validated through extensive experiments on the RflyMAD dataset.
The remainder of this article is organized as follows. First, we describe the problem formulation and preliminaries related to quadrotor fault types and multi-source fusion. Next, we detail the architecture of the ResMCNN-Transformer model and the transfer learning strategy. We then present experimental results and analysis, including comparisons with baseline methods and ablation studies. Finally, we conclude with key findings and directions for future work.
Problem Description and Preliminaries
Quadrotor UAVs, such as those simulated in the RflySim platform, consist of several interconnected components, including controllers, actuator models, dynamic and kinematic models, sensors, and filters. The actuator system typically comprises brushless DC motors and propellers, while sensors include inertial measurement units (IMUs) with accelerometers, gyroscopes, magnetometers, and GPS modules. Faults in these systems can manifest as performance degradations or complete failures, necessitating accurate diagnosis to ensure safe operation.
Faults in quadrotors are broadly categorized into three types: actuator faults, sensor faults, and other faults (e.g., battery issues, load leaks, and environmental disturbances like wind). Actuator faults, such as motor efficiency loss or propeller damage, are often reflected in time-domain signals like motor RPM. For instance, a motor fault with efficiency reduced to 70% would show a noticeable drop in RPM output. Similarly, propeller faults alter thrust generation, observable in control signals. Sensor faults, however, are more complex and may involve noise gain or bias injections. For example, accelerometer or gyroscope faults can introduce Gaussian noise with gains ranging from 0.5 to 2 times the normal levels, making them challenging to detect solely from time-domain data. Other faults, such as battery voltage drops or wind-induced disturbances, affect multiple system parameters and require integrated analysis of temporal and spectral features.
To address these variabilities, we employ multi-source information fusion, which combines data from control inputs (e.g., motor commands) and sensor outputs (e.g., Euler angles, angular velocities). Data fusion involves resampling and filtering signals to align timestamps and reduce noise. For example, high-frequency IMU data are downsampled, while low-frequency GPS data are interpolated to a uniform sampling rate of 120 Hz. This ensures temporal consistency across all data sources. Feature fusion then extracts complementary characteristics: time-domain signals are processed using 1D convolutional kernels, while time-frequency representations (e.g., via continuous wavelet transform) are analyzed with 2D convolutional kernels. The resulting features are aligned and concatenated to form a unified representation for fault classification.
Mathematically, data fusion can be represented as follows. Let \( x_{\text{control}}(t) \) denote control signals (e.g., motor RPM) and \( x_{\text{sensor}}(t) \) represent sensor readings (e.g., acceleration). After resampling and filtering, the fused data vector \( X_{\text{fused}} \) is given by:
$$ X_{\text{fused}} = [\text{Downsample}(x_{\text{control}}); \text{Interpolate}(x_{\text{sensor}})] $$
where Downsample and Interpolate operations ensure all signals share a common time base. For feature fusion, let \( f_{\text{1D}} \) and \( f_{\text{2D}} \) denote features extracted from time-domain and time-frequency domains, respectively. The fused feature vector \( z \) is computed as:
$$ z = \text{ReLU}(W \cdot [f_{\text{1D}} \oplus f_{\text{2D}}] + b) $$
where \( \oplus \) denotes concatenation, \( W \) and \( b \) are learnable parameters, and ReLU is the activation function. This approach enables the model to capture both local temporal patterns and global spectral characteristics, enhancing diagnostic robustness.
Transfer learning is employed to adapt the diagnostic model to different flight conditions, such as hovering and circling. The core idea is to pre-train the model on a source domain (e.g., hovering data) and fine-tune it on a target domain (e.g., circling data) using limited labeled samples. This reduces the need for extensive data collection in new environments and improves generalization. In our framework, the feature extraction modules (ResMCNN) are frozen during fine-tuning, while the Transformer encoder is updated to align with target domain distributions.
ResMCNN-Transformer Fault Diagnosis Model
Our proposed fault diagnosis model, termed ResMCNN-Transformer, integrates multi-scale residual convolutional networks with an improved Transformer encoder. The overall workflow consists of three steps: (1) data preprocessing, including noise reduction and oversampling to handle class imbalances; (2) multi-source feature extraction via parallel 1D and 2D convolutional pathways; and (3) feature fusion and classification using the Transformer encoder. Below, we detail each component.
Multi-scale Residual Convolutional Network (ResMCNN)
The ResMCNN module is designed to extract multi-scale features from both time-domain signals and time-frequency representations. It addresses the limitation of fixed-scale convolutional kernels by employing parallel pathways with different kernel sizes and dimensions. The 1D convolutional branch processes raw time-series data (e.g., motor RPM, Euler angles) to capture temporal dependencies, while the 2D convolutional branch analyzes time-frequency images (e.g., wavelet transforms) to identify spectral anomalies. Each branch incorporates residual connections to mitigate gradient vanishing and enhance feature propagation.
Formally, for an input sequence \( x \), the 1D convolutional output \( f_{\text{1D}} \) and 2D convolutional output \( f_{\text{2D}} \) are computed as:
$$ f_{\text{1D}} = \text{ReLU}(W_{\text{1D}} * x + b_{\text{1D}}) $$
$$ f_{\text{2D}} = \text{ReLU}(W_{\text{2D}} * x + b_{\text{2D}}) $$
where \( * \) denotes convolution, \( W_{\text{1D}} \) and \( W_{\text{2D}} \) are kernel weights, and \( b_{\text{1D}} \) and \( b_{\text{2D}} \) are biases. The features are then concatenated and fused:
$$ z = \text{ReLU}(W_z \cdot [f_{\text{1D}} \oplus f_{\text{2D}}] + b_z) $$
A residual block is applied to the fused features to deepen the network:
$$ F(z) = W_2 \cdot \text{ReLU}(W_1 \cdot z + b_1) + b_2 $$
$$ y = z + F(z) $$
where \( W_1, W_2, b_1, b_2 \) are parameters of the residual layers. This structure allows the model to learn complex fault signatures while maintaining stability during training.
Improved Transformer Encoder
The fused features are passed to an improved Transformer encoder, which leverages self-attention mechanisms to capture global dependencies and long-range interactions in the data. The encoder incorporates patch embedding to segment the input sequence into smaller patches, each projected into a high-dimensional space. Positional encodings are added to retain temporal order:
$$ X_{\text{input}} = X + P $$
where \( X \) is the patch embeddings and \( P \) is the positional encoding. The multi-head self-attention mechanism computes attention scores as:
$$ \text{Attention}(Q, K, V) = \text{Softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $$
where \( Q, K, V \) are query, key, and value matrices, and \( d_k \) is the key dimension. The output is processed through a feed-forward network (FFN) with ReLU activation:
$$ \text{FFN}(x) = \max(0, xW_1 + b_1)W_2 + b_2 $$
To enhance local feature sensitivity, we insert a lightweight convolutional layer before the final classification head. This allows the model to focus on fine-grained details critical for distinguishing fault severities. The Transformer output is then flattened and passed through a fully connected layer with Softmax activation for fault classification.
Transfer Learning Strategy
For cross-condition adaptation, we employ a transfer learning strategy where the ResMCNN modules are pre-trained on source domain data (e.g., hovering flights) and frozen during fine-tuning on target domain data (e.g., circling flights). Only the Transformer encoder parameters are updated to minimize distribution shifts. This approach reduces computational costs and enables rapid deployment in new environments. The loss function used is cross-entropy, and optimization is performed with adaptive learning rate decay.
Simulation Experiments and Results Analysis
We validate our approach using the RflyMAD dataset, which includes fault data from quadrotor UAVs under various flight conditions. The dataset encompasses six fault types: normal operation, motor faults, propeller faults, wind disturbances, load leaks, and sensor faults, with varying severities. For example, motor efficiency faults range from 65% to 99%, while sensor noise gains vary between 0.5 and 1.2. Data samples consist of 1024 time points each, resampled to 120 Hz. To address class imbalance, oversampling with 20% overlap is applied to minority classes like normal and battery faults.
The dataset is split into training (70%), validation (15%), and test (15%) sets. Data augmentation techniques, including Gaussian noise injection and random masking, are used to improve model robustness. Input features include Euler angles, three-axis linear accelerations, and angular velocities, normalized via Z-score standardization. Training is conducted for 80 epochs with an initial learning rate of 0.005, decayed exponentially by 5% per epoch. The batch size is set to 32.
Key parameters of the ResMCNN-Transformer model are summarized in Table 1. The 1D convolutional branch uses kernel sizes of 2, 4, and 8 with stride 2, while the 2D branch employs kernels of size 2×2, 3×3, and 5×5. The Transformer encoder has 4 attention heads, a hidden dimension of 256, and 3 layers.
| Module | Parameters | Input Size | Output Size |
|---|---|---|---|
| Original Input | — | [32, 18, 1024] | [32, 18, 1024] |
| STFTLayer | 128/8/128 | [32, 18, 1024] | [32, 18, 65, 113] |
| CNN1dBlock | 2/4/8, stride 2 | [32, 18, 1024] | [32, 64, 128] |
| BatchNorm | — | [32, 64, 128] | [32, 64, 128] |
| ReLU | — | [32, 64, 128] | [32, 64, 128] |
| Residual Block | 1, 8 | [32, 18, 1024] | [32, 64, 128] |
| CNN2dBlock | 2/3/5, stride 2 | [32, 18, 65, 113] | [32, 64, 8, 16] |
| BatchNorm | — | [32, 64, 8, 16] | [32, 64, 8, 16] |
| ReLU | — | [32, 64, 8, 16] | [32, 64, 8, 16] |
| MaxPooling | — | [32, 64, 8, 16] | [32, 64, 8, 16] |
| Residual Block | 1, 2 | [32, 18, 65, 113] | [32, 64, 8, 16] |
| Embedding | — | [32, 48, 64] | [32, 49, 64] |
| TransformerEncoder | 4, 256, 3 | [32, 49, 64] | [32, 49, 64] |
| Residual Block | 3, 1 | [32, 49, 64] | [32, 49, 64] |
| Fully Connected | 256 | [32, 3136] | [32, 7] |
Evaluation metrics include accuracy, precision, recall, and F1 score, defined as:
$$ \text{Accuracy} = \frac{N_{\text{TP}} + N_{\text{TN}}}{N_{\text{TP}} + N_{\text{TN}} + N_{\text{FP}} + N_{\text{FN}}} $$
$$ \text{Precision} = \frac{N_{\text{TP}}}{N_{\text{TP}} + N_{\text{FP}}} $$
$$ \text{Recall} = \frac{N_{\text{TP}}}{N_{\text{TP}} + N_{\text{FN}}} $$
$$ \text{F1 Score} = \frac{2 \cdot \text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}} $$
where \( N_{\text{TP}} \), \( N_{\text{TN}} \), \( N_{\text{FP}} \), and \( N_{\text{FN}} \) denote true positives, true negatives, false positives, and false negatives, respectively.
Fault Diagnosis Results in Hovering Condition
Under hovering conditions, our ResMCNN-Transformer model achieves an overall accuracy of 97.73%, with all fault types exceeding 95% diagnostic success. Comparative experiments with baseline models—AlexNet, ResNet-18, and MCNN—show that our approach outperforms them by significant margins (19.24%, 11.54%, and 13.41% in accuracy, respectively). The training loss and accuracy curves demonstrate rapid convergence and stability, with the model reaching near-optimal performance within 60 epochs. Confusion matrices reveal balanced performance across classes, even for imbalanced categories like normal operation (4.6% of samples) and battery faults (4.6%). For instance, motor and load faults are identified with 100% accuracy, while other faults maintain precision above 95%.
Table 2 summarizes the performance metrics for different models. Our method achieves a precision of 0.961, recall of 0.925, and F1 score of 0.930, indicating robust classification across all fault types. In contrast, AlexNet and ResNet-18 show lower recall and F1 scores due to their inability to handle feature variability.
| Model | Precision | Recall | F1 Score |
|---|---|---|---|
| AlexNet | 0.726 | 0.663 | 0.643 |
| MCNN | 0.803 | 0.758 | 0.759 |
| ResNet-18 | 0.693 | 0.680 | 0.679 |
| ResMCNN-Transformer | 0.961 | 0.925 | 0.930 |
An ablation study further validates the importance of multi-source fusion. When 50% of input features are randomly removed, the model requires 400 epochs to achieve 92.15% accuracy, underscoring the critical role of integrated data sources in capturing comprehensive fault representations.
Cross-Condition Transfer Learning Results
For circling conditions, transfer learning enables the model to adapt quickly, converging within 20 epochs and achieving an overall accuracy of 88%, close to the 90% accuracy obtained by full retraining. We compare three settings: (1) full model with Transformer and transfer learning, (2) ResMCNN-only transfer without attention, and (3) no transfer learning (full retraining). The full model with transfer learning achieves the best balance between speed and performance, with precision, recall, and F1 scores all around 88%. In contrast, the ResMCNN-only approach drops to 80% in overall metrics, highlighting the importance of the Transformer’s attention mechanism for feature adaptation.
Visualization of feature distributions using t-SNE reveals distinct clusters for source (hovering) and target (circling) domains, confirming the need for domain adaptation. The transfer learning strategy effectively aligns these distributions, enabling the model to generalize across conditions without significant performance degradation.
Conclusion
In this work, we present a comprehensive framework for fault diagnosis in quadrotor UAVs that combines multi-source information fusion with transfer learning. The ResMCNN-Transformer model effectively extracts and fuses temporal and time-frequency features, while the improved Transformer encoder captures global dependencies for accurate fault classification. Experimental results on the RflyMAD dataset demonstrate superior performance in hovering conditions, with high accuracy across varying fault severities. Moreover, the transfer learning strategy facilitates rapid adaptation to new flight conditions, such as circling, with minimal performance loss.
However, limitations remain. The data-driven approach requires substantial labeled data, which may not be available for all fault scenarios or flight modes. Future work will explore hybrid methods that integrate model-based techniques with data-driven approaches to enable few-shot learning and reduce dependency on large datasets. Additionally, extending the framework to real-world quadrotor systems with online fault detection capabilities could enhance practical applicability.
In summary, our approach addresses key challenges in quadrotor fault diagnosis, offering a scalable and adaptive solution for multi-operational environments. By leveraging multi-source fusion and transfer learning, we pave the way for more reliable and autonomous UAV operations in diverse applications.
