In recent years, the widespread adoption of unmanned aerial vehicles (UAVs), particularly quadrotor drones, has revolutionized various sectors, including agriculture, logistics, and surveillance. As a drone manufacturer, ensuring the reliability and safety of these systems under diverse operational conditions is paramount. Traditional fault diagnosis methods, which often rely on single-mechanism models or data from specific scenarios, struggle to address the complexities of varying fault severities and dynamic flight environments. This paper proposes an integrated approach combining multi-source information fusion and transfer learning to achieve adaptive fault diagnosis for quadrotor drones across multiple operational conditions. By fusing data from control signals and attitude information, extracting features through parallel 1D and 2D convolutional channels, and leveraging an enhanced Transformer encoder with transfer learning, the method enables rapid and high-precision diagnosis. Experimental validation on the RflyMAD dataset demonstrates exceptional performance, with a 97.73% accuracy in hovering conditions and effective adaptation to circling conditions via transfer learning, achieving 88% accuracy with minimal epochs. This approach addresses critical challenges faced by drone manufacturers in maintaining operational integrity and reducing downtime.

The core of this methodology lies in multi-source information fusion, which integrates data from diverse sensors and control systems. For a drone manufacturer, this involves harmonizing inputs such as motor RPM, Euler angles, and inertial measurement unit (IMU) data. Data fusion begins with down-sampling high-frequency signals and applying interpolation to align time stamps, followed by filtering to remove noise. This process ensures temporal consistency and enhances data quality. The fused data is then subjected to feature extraction, where time-domain signals are processed using 1D convolutional networks, and time-frequency representations, generated via continuous wavelet transform (CWT), are analyzed with 2D convolutional networks. The extracted features are aligned and concatenated to form a unified representation, incorporating residual connections to preserve low-level details. This feature fusion enables the model to capture both local and global patterns essential for diagnosing faults of varying severities.
To handle the complexities of different operational conditions, transfer learning is employed. Initially, the model is pre-trained on data from a source domain, such as hovering flights, where labeled data is abundant. The learned features, which include generic fault characteristics, are then transferred to a target domain, like circling flights, by fine-tuning the Transformer encoder while freezing the feature extraction modules. This strategy allows a drone manufacturer to quickly adapt the diagnostic system to new environments without extensive retraining, significantly reducing development costs and improving scalability. The integration of multi-head attention mechanisms in the Transformer facilitates adaptive weighting of features based on operational contexts, further enhancing generalization.
The fault diagnosis model, termed ResMCNN-Transformer, comprises two main components: a multi-scale residual convolutional network (ResMCNN) and an improved Transformer encoder. The ResMCNN module employs parallel convolutional paths with kernels of different sizes (e.g., 1D kernels for temporal features and 2D kernels for time-frequency features) to capture multi-scale characteristics. The outputs are fused through concatenation and linear projection, with residual blocks added to mitigate gradient vanishing and stabilize training. The fused features are then passed to the Transformer encoder, which utilizes self-attention mechanisms to model long-range dependencies and a feed-forward network for deep analysis. A lightweight convolutional layer is incorporated before the final classification head to emphasize local details, ensuring robust fault classification. The overall process can be summarized mathematically as follows:
Let $x$ represent the input multi-source data, which includes control signals and sensor readings. After data fusion, the time-domain signals $x_{t}$ and time-frequency representations $x_{tf}$ are processed in parallel. For the 1D convolutional branch:
$$f_{1d} = \text{ReLU}(W_{1d} * x_{t} + b_{1d})$$
and for the 2D convolutional branch:
$$f_{2d} = \text{ReLU}(W_{2d} * x_{tf} + b_{2d})$$
where $W_{1d}$ and $W_{2d}$ are convolution kernels, $*$ denotes the convolution operation, and $b_{1d}$ and $b_{2d}$ are bias terms. The features are then concatenated and fused:
$$z = \text{ReLU}(W \cdot [f_{1d} \oplus f_{2d}] + b)$$
where $\oplus$ represents channel-wise concatenation, and $W$ and $b$ are learnable parameters for fusion. A residual function $F(z)$ is applied:
$$F(z) = W_2 \cdot \text{ReLU}(W_1 \cdot z + b_1) + b_2$$
and the output is $y = z + F(z)$. The features are flattened and input to the Transformer encoder, which computes self-attention as:
$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$
where $Q$, $K$, and $V$ are query, key, and value matrices derived from the input, and $d_k$ is the dimension of the keys. The feed-forward network applies:
$$\text{FFN}(x) = \max(0, xW_1 + b_1)W_2 + b_2$$
Finally, the output is classified using a softmax layer. This architecture enables the model to handle imbalanced data and varying fault types effectively.
Experiments were conducted using the RflyMAD dataset, which includes data from hovering and circling flights with six fault types: normal, motor, propeller, wind, load, sensor, and battery faults. The dataset comprises 1,538 samples, with labels and distributions as shown in Table 1. Data preprocessing involved resampling to 120 Hz, noise injection with Gaussian jitter, and Z-score normalization to standardize inputs. The model was trained for 80 epochs with an initial learning rate of 0.005, decaying exponentially by 5% per epoch, and cross-entropy loss was used for optimization.
| Fault Type | Sample Count | Label |
|---|---|---|
| Normal | 70 | 0 |
| Motor | 470 | 1 |
| Propeller | 238 | 2 |
| Wind | 200 | 3 |
| Load | 200 | 4 |
| Sensor | 290 | 5 |
| Battery | 70 | 6 |
The model parameters are detailed in Table 2. The ResMCNN module uses convolutional kernels of sizes 2, 4, and 8 for 1D processing and 2×2, 3×3, and 5×5 for 2D processing, with residual blocks to enhance feature flow. The Transformer encoder has 4 attention heads and a hidden dimension of 256. Evaluation metrics include accuracy, precision, recall, and F1 score, defined as:
$$\text{Accuracy} = \frac{N_{TP} + N_{TN}}{N_{TP} + N_{TN} + N_{FP} + N_{FN}}$$
$$\text{Precision} = \frac{N_{TP}}{N_{TP} + N_{FP}}$$
$$\text{Recall} = \frac{N_{TP}}{N_{TP} + N_{FN}}$$
$$\text{F1 Score} = \frac{2 \cdot \text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}$$
where $N_{TP}$, $N_{TN}$, $N_{FP}$, and $N_{FN}$ represent true positives, true negatives, false positives, and false negatives, respectively.
| Module | Parameters | Input Size | Output Size |
|---|---|---|---|
| Original Input | — | [32, 18, 1024] | [32, 18, 1024] |
| STFTLayer | 128/8/128 | [32, 18, 1024] | [32, 18, 65, 113] |
| CNN1dBlock | 2/4/8, 2 | [32, 18, 1024] | [32, 64, 128] |
| BatchNorm | — | [32, 64, 128] | [32, 64, 128] |
| ReLU | — | [32, 64, 128] | [32, 64, 128] |
| Residual Block | 1, 8 | [32, 18, 1024] | [32, 64, 128] |
| CNN2dBlock | 2/3/5, 2 | [32, 18, 65, 113] | [32, 64, 8, 16] |
| BatchNorm | — | [32, 64, 8, 16] | [32, 64, 8, 16] |
| ReLU | — | [32, 64, 8, 16] | [32, 64, 8, 16] |
| MaxPooling | — | [32, 64, 8, 16] | [32, 64, 8, 16] |
| Residual Block | 1, 2 | [32, 18, 65, 113] | [32, 64, 8, 16] |
| Embedding | — | [32, 48, 64] | [32, 49, 64] |
| TransformerEncoder | 4, 256, 3 | [32, 49, 64] | [32, 49, 64] |
| Residual Block | 3, 1 | [32, 49, 64] | [32, 49, 64] |
| Fully Connected | 256 | [32, 3136] | [32, 7] |
In hovering conditions, the proposed ResMCNN-Transformer model achieved an accuracy of 97.73%, outperforming baseline models like AlexNet (78.28%), ResNet-18 (84.32%), and MCNN (85.69%). The confusion matrix showed perfect classification for motor and load faults, with all fault types exceeding 95% accuracy. Precision, recall, and F1 scores were consistently above 92.5%, as summarized in Table 3. Ablation studies confirmed the necessity of multi-source fusion; randomly removing 50% of input features required 400 epochs to reach 92.15% accuracy, highlighting the importance of comprehensive data integration for a drone manufacturer seeking reliable diagnostics.
| Model | Precision | Recall | F1 Score |
|---|---|---|---|
| AlexNet | 0.726 | 0.663 | 0.643 |
| MCNN | 0.803 | 0.758 | 0.759 |
| ResNet-18 | 0.693 | 0.680 | 0.679 |
| ResMCNN-Transformer | 0.961 | 0.925 | 0.930 |
For cross-condition adaptation, transfer learning was applied to circling flights using the pre-trained hovering model. By freezing the ResMCNN parameters and fine-tuning the Transformer, the model converged in approximately 20 epochs, achieving 88% accuracy across metrics, close to the 90% achieved by full retraining. In contrast, models without attention mechanisms or transfer learning showed reduced performance (around 80%), emphasizing the role of adaptive feature weighting. This approach enables a drone manufacturer to deploy fault diagnosis systems that quickly adapt to new operational scenarios, minimizing downtime and maintenance costs.
In conclusion, the integration of multi-source information fusion and transfer learning provides a robust solution for quadrotor drone fault diagnosis across varying conditions. The ResMCNN-Transformer model effectively classifies faults of different severities and adapts to new environments with minimal retraining. For a drone manufacturer, this translates to enhanced operational safety and efficiency. Future work will focus on incorporating model-based methods to address data scarcity and further improve generalization in diverse real-world scenarios.
