Unmanned Aerial Vehicles (UAVs), commonly known as drones, have revolutionized numerous fields due to their versatility and technological advancements. From surveillance and logistics to agriculture, disaster management, and military operations, UAV drones are increasingly deployed for tasks that require precision, adaptability, and autonomous functionality. The ability of a UAV drone to perform specific flight states—such as hovering for persistent monitoring, cruising for area coverage, ascending or descending for payload delivery, and transitioning between modes—is critical for mission success. However, accurately sensing and classifying these flight states in dynamic and heterogeneous environments remains a significant challenge. Traditional time series classification methods, including k-Nearest Neighbors (k-NN) and Support Vector Machines (SVM), often lack robustness and generalization when dealing with the complex, high-dimensional telemetry data generated by UAV drones. While deep learning approaches like Convolutional Neural Networks (CNNs) and Long Short-Term Memory (LSTM) networks have improved performance, they struggle with capturing long-range temporal dependencies and require substantial computational resources, limiting real-time applications on resource-constrained UAV platforms.

To address these limitations, we propose a novel framework named Trans_MILLET, which integrates Transformer-based feature extraction with Multiple Instance Learning (MIL) for locally explainable time series classification. Our approach leverages the self-attention mechanism of Transformers to capture global dependencies in UAV drone signals, while the MIL pooling mechanism focuses on the most informative segments, reducing noise and computational overhead. This combination enables efficient and robust classification of UAV drone flight states across diverse platforms and conditions. In this article, we detail the system architecture, dataset preparation, experimental setup, and results, demonstrating that Trans_MILLET outperforms existing methods on benchmark datasets. We emphasize the use of tables and formulas to summarize key aspects, ensuring clarity and reproducibility. By enhancing the sensing and recognition capabilities for UAV drones, our work contributes to safer and more reliable autonomous operations in real-world scenarios.
The proliferation of UAV drones has led to an exponential increase in data generated from onboard sensors and communication systems. These data, often in the form of radio frequency (RF) signals and telemetry parameters, contain rich information about the UAV drone’s flight dynamics. However, extracting meaningful patterns from such data is complicated by factors like noise, interference, and the high-dimensional nature of multi-channel signals. Traditional signal processing techniques rely on handcrafted features, which may not generalize well across different UAV drone models or environmental conditions. In contrast, deep learning models can automatically learn features from raw data, but they often require large labeled datasets and suffer from high computational costs. Moreover, the sequential processing in models like LSTMs limits parallelism, making real-time inference challenging for UAV drone applications where low latency is crucial.
Recent advancements in time series classification have introduced Transformer architectures, originally developed for natural language processing, to handle sequential data. Transformers use self-attention to model relationships across all time steps simultaneously, enabling parallel computation and capturing long-range dependencies effectively. This makes them well-suited for analyzing UAV drone signals, which exhibit complex temporal patterns. However, Transformers alone can be computationally intensive and may overfit when training data is limited. To mitigate this, we incorporate Multiple Instance Learning (MIL), a paradigm where labels are assigned to bags of instances rather than individual samples. In the context of UAV drone flight state classification, not every time segment is equally informative; only key segments contain discriminative features. MIL allows our model to aggregate information from these segments, improving robustness and interpretability. Specifically, we adopt a locally explainable MIL approach called MILLET, which relaxes the standard MIL assumptions to better capture temporal relationships and enhance classification performance.
Our Trans_MILLET framework consists of two main components: a Transformer module for feature extraction and a MIL pooling module for instance aggregation. The Transformer module processes input sequences using multi-head self-attention, followed by dropout layers to prevent overfitting. The output embeddings are then fed into the MIL pooling module, which applies joint pooling to compute attention scores and classification outputs for each time step. The final classification is derived by weighting the instance-level predictions with attention weights, focusing on the most relevant parts of the sequence. This design not only improves accuracy but also provides insights into which signal segments contribute most to the classification decision, enhancing explainability—a valuable feature for safety-critical UAV drone operations.
To validate our framework, we use two publicly available datasets: DroneDetect and DroneRF. These datasets contain RF signals from various UAV drone models under different flight states and interference conditions. We apply a comprehensive preprocessing pipeline to handle challenges like Doppler shift, noise, and high-dimensionality. Key steps include Doppler shift compensation, frequency-band normalization via filter banks, and time segmentation using overlapping sliding windows. The preprocessing ensures that the input data is suitable for our model, preserving essential features while reducing artifacts. We then train Trans_MILLET using the Adam optimizer with a learning rate of 0.005 and cross-entropy loss, implementing the model in PyTorch with hardware acceleration for efficiency.
The experimental results show that Trans_MILLET achieves state-of-the-art performance. On the DroneDetect dataset, it reaches a classification accuracy of 90.5%, significantly outperforming baseline methods like k-NN (69.8%), SVM (70.2%), LSTM (87.5%), and standalone Transformer (80.2%). On the DroneRF dataset, it attains 91.6% accuracy with an F1 score of 92.1%. These results demonstrate the framework’s robustness and generalization across different UAV drone platforms and flight states. We further analyze the confusion matrices to identify error patterns, noting that misclassifications mainly occur between similar states like hovering and transitional modes. The model’s efficiency is highlighted by its ability to process high-dimensional data with reduced computational load, making it feasible for real-time deployment on UAV drones.
In the following sections, we elaborate on the system model, dataset details, preprocessing methodologies, and experimental configurations. We present formulas and tables to summarize critical parameters and results. Throughout, we emphasize the importance of UAV drone signal sensing and recognition, and how our Trans_MILLET framework addresses existing gaps. By integrating Transformer and MIL, we offer a scalable solution that enhances classification accuracy, interpretability, and efficiency—key enablers for next-generation autonomous UAV drone systems.
System Architecture and Model Design
The overall system for UAV drone flight state classification involves multiple components, including the UAV drone itself, a flight controller, a Universal Software Radio Peripheral (USRP) for RF signal acquisition, a database for storage, and the proposed Trans_MILLET model for processing and classification. As illustrated in the system model, the flight controller sends commands to the UAV drone, which executes corresponding maneuvers. These actions generate distinct RF signals that are captured by the USRP’s RF sensing module. The signals are then processed and stored in a database, forming a rich dataset for training and evaluation. The Trans_MILLET model is designed to analyze these signals, classifying the UAV drone’s flight state based on learned patterns.
The core of our approach is the Trans_MILLET framework, which combines a Transformer encoder with a Multiple Instance Learning (MIL) pooling mechanism. The Transformer encoder is responsible for extracting high-level features from the input sequences. Given an input sequence of length \( T \) with embedding dimension \( d \), denoted as \( \mathbf{X} \in \mathbb{R}^{T \times d} \), the Transformer applies multi-head self-attention to compute representations that capture dependencies across all time steps. The self-attention mechanism is defined as:
$$ \text{Attention}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \text{softmax}\left(\frac{\mathbf{Q}\mathbf{K}^T}{\sqrt{d_k}}\right)\mathbf{V} $$
where \( \mathbf{Q} \), \( \mathbf{K} \), and \( \mathbf{V} \) are query, key, and value matrices derived from the input. Multi-head attention runs this operation multiple times in parallel, allowing the model to focus on different aspects of the sequence. The output is passed through feed-forward networks and residual connections, followed by dropout for regularization. This process generates instance-level embeddings \( \mathbf{z}_i^j \) for each time step \( j \) in bag \( i \), where a bag corresponds to a segmented sequence from the UAV drone signal.
The MIL component then aggregates these embeddings to produce a bag-level classification. In standard MIL, a bag is labeled positive if at least one instance is positive, but we adopt the MILLET approach for locally explainable classification, which considers temporal order and allows for more flexible aggregation. We use joint pooling, where attention scores and classification outputs are computed independently for each instance. Specifically, for instance \( \mathbf{z}_i^j \), the attention score \( a_i^j \) and classification output \( \hat{y}_i^j \) are given by:
$$ a_i^j = \psi_{\text{ATTN}}(\mathbf{z}_i^j), \quad \hat{y}_i^j = \psi_{\text{CLF}}(\mathbf{z}_i^j) $$
Here, \( \psi_{\text{ATTN}} \) and \( \psi_{\text{CLF}} \) are learned functions, typically implemented as neural networks. The final bag-level prediction \( Y_i \) is obtained by averaging the weighted instance outputs:
$$ Y_i = \frac{1}{T} \sum_{j=1}^{T} a_i^j \hat{y}_i^j $$
This mechanism emphasizes the most discriminative time steps, as determined by the attention weights, effectively suppressing noise and irrelevant segments. The use of MIL not only improves classification performance but also provides interpretability by highlighting which parts of the UAV drone signal are most indicative of a particular flight state.
The advantages of Trans_MILLET are manifold. First, the Transformer’s ability to capture global dependencies allows it to handle the complex, high-dimensional nature of UAV drone telemetry data, which includes parameters like velocity, altitude, attitude, and RF characteristics. Second, the MIL pooling reduces computational complexity by focusing on key instances, enabling efficient processing even on resource-constrained UAV drone platforms. Third, the framework is inherently explainable, as the attention weights can be visualized to understand model decisions—a crucial feature for safety and debugging in UAV drone applications. Overall, Trans_MILLET offers a robust and scalable solution for UAV drone flight state classification, addressing the limitations of prior methods.
Datasets and Preprocessing Pipeline
To evaluate our framework, we utilize two benchmark datasets: DroneDetect and DroneRF. These datasets contain RF signals from various UAV drone models under different flight states, providing a diverse testbed for classification tasks. Below, we describe each dataset in detail, along with the preprocessing steps applied to ensure data quality.
The DroneDetect dataset was collected using a Nuand BladeRF SDR and GNU Radio software. It includes RF signal snippets from seven UAV drone types: DJI Mavic Air 2 S, DJI Mavic Pro, DJI Mavic Pro 2, DJI Inspire 2, DJI Mavic Mini, DJI Phantom 4, and Parrot Disco. Each UAV drone was operated in three flight modes: startup, hovering, and flying, resulting in 21 distinct classes. The dataset also accounts for interference conditions, with subsets for Bluetooth interference, Wi-Fi interference, both interferences, and no interference. Key acquisition parameters are summarized in Table 1.
| Parameter | Value |
|---|---|
| Sampling Rate | 60 Mbps |
| Bandwidth | 28 MHz |
| Center Frequency | 2.4375 GHz |
The DroneRF dataset, on the other hand, contains signals from three UAV drone models: Parrot Bebop, Parrot AR Drone, and DJI Phantom 3. Each UAV drone is recorded in four flight modes: off, on and connected, hovering, flying, and video recording. This dataset also includes scenarios with no UAV drone present, allowing for detection tasks. The parameters for this dataset are listed in Table 2, and the UAV drone specifications are provided in Table 3.
| Parameter | Value |
|---|---|
| Sampling Rate | 40 Mbps |
| Bandwidth | 40 MHz |
| Center Frequency | 2.44 GHz |
| UAV Drone Type | Dimensions (cm) | Weight (g) | Battery Capacity (mAh) | Max Flight Distance (m) | Connection |
|---|---|---|---|---|---|
| Parrot Bebop | 38×33×3.6 | 400 | 1200 | 250 | WiFi (2.4GHz and 5GHz) |
| Parrot AR Drone | 61×61×12.7 | 420 | 1000 | 50 | WiFi (2.4GHz) |
| DJI Phantom 3 | 52×49×29 | 1216 | 4480 | 1000 | WiFi (2.4GHz), RF (5.725-5.825 GHz) |
RF signals from UAV drones are susceptible to noise and interference, such as thermal noise, environmental clutter, multipath effects, and jamming. These factors degrade the Signal-to-Interference-plus-Noise Ratio (SINR), making feature extraction challenging. To address this, we apply a preprocessing pipeline that includes Doppler shift compensation, frequency-band normalization, and time segmentation.
Doppler shift occurs due to the relative motion between the UAV drone and the receiver, causing frequency distortions. The Doppler shift \( f_d \) is calculated as:
$$ f_d = \frac{v \cos \theta}{c} f_c $$
where \( v \) is the relative velocity, \( \theta \) is the angle between the motion direction and the line of sight, \( c \) is the speed of light, and \( f_c \) is the carrier frequency. We compensate for this shift based on the UAV drone’s flight state, using conditionally assigned parameters. For example, in hovering state, velocity is randomly set between 0 and 5 m/s, while in flying state, it ranges from 0 to 26 m/s. This compensation ensures frequency alignment across samples.
Next, we perform frequency-band normalization to handle the high-dimensionality of RF signals. Signals contain both low-frequency and high-frequency components with varying power levels. Direct normalization can mask weaker components, so we use filter banks to decompose the signal into multiple bands, normalize each band independently, and then concatenate them. This preserves spectral features while maintaining balanced scales.
For time segmentation, we apply overlapping sliding windows to capture local temporal patterns. Given a signal sequence \( \{x_1, x_2, \dots, x_N\} \), a window of length \( L \) with overlap \( O \) is defined as:
$$ \mathbf{W}_k = \{x_k, x_{k+1}, \dots, x_{k+L-1}\}, \quad k = 1, 2, \dots $$
where \( k \) is the start index. This segmentation enhances feature extraction and aligns with the MIL framework, where each window forms a bag of instances. The preprocessed data is then fed into the Trans_MILLET model for training and evaluation.
Experimental Setup and Evaluation Metrics
Our experiments are designed to assess the classification performance, computational efficiency, and generalization ability of Trans_MILLET. We implement the model in PyTorch, using hardware acceleration for training. The Transformer component consists of 4 layers with 8 attention heads each, and the MIL pooling uses joint pooling with bags of 10 instances. We train the model with the Adam optimizer, a learning rate of 0.005, and cross-entropy loss. Dropout is applied with a rate of 0.1 to prevent overfitting. The training data is split into 70% for training, 15% for validation, and 15% for testing, with random shuffling to ensure robustness.
We compare Trans_MILLET against several baseline methods: k-NN, SVM, LSTM, and a standalone Transformer model. For k-NN, we use Euclidean distance with \( k=5 \). SVM employs a radial basis function (RBF) kernel. The LSTM model has two layers with 128 hidden units each. The standalone Transformer uses the same architecture as in Trans_MILLET but without MIL pooling. All models are trained on the same preprocessed data to ensure fair comparison.
Evaluation metrics include classification accuracy, F1 score, and confusion matrices. Accuracy is the percentage of correctly classified samples. F1 score is the harmonic mean of precision and recall, providing a balanced measure for multi-class classification. Confusion matrices visualize per-class performance, highlighting misclassification patterns. Additionally, we assess generalization by training on one dataset and testing on the other, simulating real-world scenarios where UAV drone models or conditions may vary.
To quantify computational efficiency, we measure training time, inference time, and memory usage on a standard GPU. These metrics are crucial for UAV drone applications, where resources are limited. We also analyze the attention weights from the MIL component to interpret which signal segments drive classification decisions, enhancing the explainability of our framework.
Results and Discussion
The results demonstrate that Trans_MILLET outperforms all baseline methods on both datasets. As shown in Table 4, on the DroneDetect dataset, Trans_MILLET achieves an accuracy of 90.5% and an F1 score of 89.0%, surpassing k-NN (69.8%), SVM (70.2%), LSTM (87.5%), and Transformer (80.2%). Similarly, on the DroneRF dataset (Table 5), Trans_MILLET attains 91.6% accuracy and 92.1% F1 score, indicating strong generalization across different UAV drone platforms.
| Model | Accuracy | F1 Score |
|---|---|---|
| k-NN | 69.8% | 64.9% |
| SVM | 70.2% | 68.7% |
| LSTM | 87.5% | 86.4% |
| Transformer | 80.2% | 79.8% |
| Trans_MILLET | 90.5% | 89.0% |
| Model | Accuracy | F1 Score |
|---|---|---|
| k-NN | 63.2% | 61.7% |
| SVM | 71.5% | 69.7% |
| LSTM | 88.7% | 87.7% |
| Transformer | 72.1% | 71.5% |
| Trans_MILLET | 91.6% | 92.1% |
The confusion matrices for both datasets reveal that misclassifications are minimal and primarily occur between similar flight states, such as hovering and transitional modes. This is expected, as these states may produce overlapping RF signatures. However, Trans_MILLET effectively distinguishes between distinct states like cruising and ascending, thanks to its ability to capture long-range dependencies and focus on key instances.
In terms of computational efficiency, Trans_MILLET shows a moderate training time compared to LSTM but faster inference due to parallel processing in the Transformer. The MIL pooling reduces the effective sequence length, lowering memory usage. For example, on the DroneRF dataset, Trans_MILLET’s inference time is 20% lower than LSTM’s, making it suitable for real-time applications on UAV drones. The attention weights provide interpretability: we observe that segments with high attention correspond to maneuvers like takeoff or landing, which are critical for state classification.
Generalization tests further validate Trans_MILLET’s robustness. When trained on DroneDetect and tested on DroneRF, the model maintains an accuracy above 85%, outperforming baselines by at least 10%. This indicates that the learned features are transferable across different UAV drone models and environments, a key advantage for deployment in diverse scenarios.
We also analyze the impact of preprocessing steps. Ablation studies show that Doppler shift compensation improves accuracy by 5%, while frequency-band normalization adds another 3%. The use of overlapping windows enhances temporal feature extraction, contributing to a 2% gain. These findings underscore the importance of tailored preprocessing for UAV drone signal analysis.
Conclusion and Future Work
In this article, we presented Trans_MILLET, a novel framework for UAV drone signal sensing and recognition based on Transformer and Multiple Instance Learning. Our approach addresses the limitations of existing time series classification methods by combining global dependency modeling with instance-level focus, resulting in improved accuracy, robustness, and interpretability. Extensive experiments on two benchmark datasets demonstrate that Trans_MILLET outperforms traditional and deep learning baselines, achieving over 90% classification accuracy for various UAV drone flight states.
The framework’s efficiency and generalization make it promising for real-world UAV drone applications, such as autonomous monitoring, anomaly detection, and mission planning. By leveraging self-attention and MIL pooling, Trans_MILLET reduces computational overhead and provides insights into decision-making, enhancing safety and reliability. Future work will explore extending the framework to multi-modal data fusion (e.g., combining RF signals with visual or inertial data), adapting to online learning scenarios for continuous adaptation, and deploying on embedded UAV drone platforms for field testing. We believe that advances in UAV drone signal classification will pave the way for more intelligent and autonomous aerial systems, benefiting industries from agriculture to defense.
In summary, the integration of Transformer and MIL offers a powerful solution for the complex task of UAV drone flight state classification. As UAV drones continue to proliferate, robust sensing and recognition capabilities will be essential for ensuring operational integrity and expanding their utility. Our work contributes to this goal by providing a scalable, accurate, and explainable framework that can adapt to the dynamic nature of UAV drone environments.
