With the rapid advancement of drone technology, Unmanned Aerial Vehicles (UAVs) have become increasingly prevalent in various applications, including aerial photography, logistics, agriculture, and rescue operations. However, the widespread use of drone technology also introduces significant security risks and regulatory challenges, such as unauthorized incursions into restricted areas, privacy violations, and potential cyber-attacks. As a critical component of counter-UAV systems, drone detection has garnered substantial attention. Traditional detection methods rely on single-modal data, such as visual, audio, radar, and radio frequency signals, but these approaches often provide limited information in complex scenarios. Recent progress in deep learning has significantly improved small object detection, and multi-modal fusion techniques have further enhanced the accuracy and robustness of target recognition. This article reviews the state-of-the-art in drone detection, with a focus on multi-modal fusion research. Additionally, it summarizes relevant evaluation metrics and public datasets, analyzes existing limitations, and suggests future research directions to improve detection precision and resilience.

Drone technology has evolved rapidly, enabling Unmanned Aerial Vehicles to perform tasks that were previously challenging or impossible. However, the proliferation of drone technology necessitates effective detection systems to mitigate security threats. Early detection methods primarily utilized single-modal approaches, which, while effective in controlled environments, often falter in real-world conditions due to factors like occlusion, noise, and environmental variability. The integration of multiple modalities, such as combining visual data with infrared, audio, or radar signals, offers a promising solution to these challenges. This review delves into the evolution of drone detection, from traditional techniques to advanced deep learning-based methods, and emphasizes the transformative potential of multi-modal fusion in enhancing the performance of Unmanned Aerial Vehicle recognition systems.
Single-Modal Drone Detection Methods
Single-modal detection methods form the foundation of early drone recognition systems. These approaches leverage data from one type of sensor, such as cameras, microphones, or radar, to identify and track Unmanned Aerial Vehicles. Despite their simplicity, they face limitations in complex scenarios, prompting the development of more sophisticated techniques.
Early Traditional Drone Detection Methods
Traditional drone detection methods typically follow a three-step process: region selection, feature extraction, and classification or regression. For instance, in visual detection, algorithms like the Histogram of Oriented Gradients (HOG) extract gradient-based features to outline targets, reducing dimensionality and mitigating environmental influences like illumination. However, HOG is sensitive to noise and may not perform well with small objects like drones. In audio-based detection, features such as Mel-Frequency Cepstral Coefficients (MFCCs) or power spectral density are used to analyze the unique acoustic signatures of drones, but these methods struggle in noisy environments. Radar-based detection relies on characteristics like Radar Cross-Section (RCS) and Doppler shift to estimate distance, speed, and trajectory, offering robustness in adverse weather but being susceptible to electromagnetic interference. Similarly, radio frequency (RF) signals provide an alternative by analyzing spectrum and modulation features, enabling detection beyond visual range, though they can be affected by signal clutter and interference.
The general workflow for traditional radar detection can be summarized as follows: signal preprocessing involves de-interleaving and sorting pulse streams, while main processing focuses on drone detection and identification using extracted features. Despite some success, these methods often lack the adaptability required for dynamic environments, leading to the adoption of deep learning techniques.
Deep Learning-Based Drone Detection Methods
Deep learning has revolutionized drone detection by automating feature extraction and improving accuracy, particularly for small objects like Unmanned Aerial Vehicles. These methods can be categorized into two-stage and one-stage detectors, each with distinct advantages and drawbacks.
Visual Detection of Drones
In visual detection, two-stage algorithms, such as the R-CNN series, first generate region proposals and then classify and regress these regions. For example, Faster R-CNN uses a Region Proposal Network (RPN) to efficiently propose candidate areas, while Mask R-CNN extends this by adding a branch for pixel-level segmentation, improving precision in dynamic scenarios. However, these methods can be computationally intensive, making them less suitable for real-time applications. In contrast, one-stage detectors like YOLO (You Only Look Once) and SSD (Single Shot MultiBox Detector) perform detection in a single pass, offering higher speed at the cost of potential accuracy loss. Recent variants, such as YOLOX and YOLOv8, have enhanced performance through architectural innovations, making them more effective for drone detection in resource-constrained settings.
To illustrate the performance of different models, consider the following table summarizing results on the DUT Anti-UAV dataset with various backbone networks:
| Model | Backbone Network | Accuracy |
|---|---|---|
| Faster R-CNN | ResNet50 | 0.653 |
| Faster R-CNN | ResNet18 | 0.605 |
| Faster R-CNN | VGG16 | 0.633 |
| Cascade R-CNN | ResNet50 | 0.683 |
| Cascade R-CNN | ResNet18 | 0.652 |
| Cascade R-CNN | VGG16 | 0.667 |
| YOLOX | ResNet50 | 0.427 |
| YOLOX | ResNet18 | 0.400 |
| YOLOX | VGG16 | 0.551 |
These results highlight the trade-offs between accuracy and efficiency, with two-stage methods generally achieving higher precision but requiring more computational resources.
Audio-Based Drone Detection
Audio signals provide complementary information for drone recognition, as Unmanned Aerial Vehicles emit distinct sounds during flight. Deep learning models, such as Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs), have been employed to extract high-level features from audio data. For instance, CNNs can process spectrograms to identify drone-specific patterns, while RNNs capture temporal dynamics for improved detection in real-time scenarios. However, audio-based methods are highly sensitive to environmental noise, limiting their effectiveness in urban or noisy settings.
The general pipeline for audio event detection involves signal acquisition, feature extraction (e.g., using MFCCs or Fourier transforms), and classification via deep learning models. The following table summarizes key studies in audio-based drone detection:
| Reference | Network | Dataset | Results |
|---|---|---|---|
| Kim et al. | CNN | Custom | Accuracy: 61% |
| Jeon et al. | CNN, RNN | DARES-G1, Litis Rouen | F-scores: 0.5232, 0.6415, 0.8009 |
| Qayyum et al. | CNN | DREGON | Superior AUC compared to benchmarks |
| Wang et al. | DNN | AS, AVQ, DREGON | Improved source localization with noise suppression |
| Luo et al. | CNN | Custom | Average recognition accuracy: 97.88% |
These approaches demonstrate the potential of audio modalities, especially when integrated with other sensors.
Radar-Based Drone Detection
Radar systems emit electromagnetic waves and analyze reflected signals to detect moving targets, providing reliable performance in various weather conditions. Deep learning techniques, such as CNNs applied to range-Doppler maps or spectrograms, have enhanced the classification and detection of drones. For example, CNNs can distinguish drones from other objects based on unique radar signatures, even in low signal-to-noise ratio (SNR) environments. However, radar data may suffer from multipath effects and limited resolution, necessitating fusion with other modalities for comprehensive coverage.
The table below outlines notable radar-based detection studies:
| Reference | Network | Dataset | Results |
|---|---|---|---|
| El Housseini et al. | CNN | MSTAR | Accuracy: 93% |
| Kim et al. | CNN | Custom | Classification accuracy: 98.89% |
| Wang et al. | DNN | Custom | Better performance at high SNR |
| Raval et al. | ResNet | Custom | F-score: 0.816 at 10 dB SNR |
| Dale et al. | CNN | Custom | Overall accuracy: 98.89% |
| Wang et al. | CNN | Custom | Accuracy: 98.5% at -20 dB SNR |
These advancements underscore the role of radar in all-weather drone detection, though challenges remain in handling small, fast-moving targets.
Challenges and Future Directions in Single-Modal Detection
Single-modal methods face several limitations: visual detection is impaired by poor lighting or weather; audio detection requires quiet environments; and radar may produce false alarms due to interference. The following table compares the characteristics of different single-modal approaches:
| Method | Advantages | Challenges | Applicable Scenarios |
|---|---|---|---|
| Visual | Rich information content | Poor performance in darkness or fog | Daylight or well-lit conditions |
| Audio | Low cost, easy deployment | Susceptible to noise, limited range | Quiet environments with budget constraints |
| Radar | Robust to environmental changes | Lower resolution | Adverse weather or low-light situations |
Future research should focus on lightweight model designs, real-time processing, and the integration of reinforcement learning or self-supervised techniques to enhance adaptability. Multi-modal fusion emerges as a key direction to address these challenges by combining complementary data sources.
Multi-Modal Drone Detection Methods
Multi-modal fusion leverages data from multiple sensors to improve detection accuracy and robustness. Common combinations include RGB and thermal infrared (RGBT), visual and audio, and visual and radar fusion, each offering unique benefits for drone technology applications.
RGBT Fusion
RGBT fusion combines visible light (RGB) and thermal infrared (TIR) images to overcome limitations in varying lighting conditions. While RGB provides detailed visual information, TIR excels in darkness or fog, enabling all-weather detection. Deep learning approaches, such as generative adversarial networks (GANs) for image fusion or attention mechanisms in YOLO-based models, have been developed to enhance feature extraction. For example, modified YOLOv5 architectures with multi-scale differential attention modules have shown significant improvements in mean Average Precision (mAP) on datasets like LLVIP, achieving up to 97.9% mAP at an IoU threshold of 0.5. Similarly, adaptive fusion networks assign weights to different modalities, optimizing performance in dynamic environments. These methods demonstrate the synergy between modalities, but they require careful handling of data synchronization and computational complexity.
Visual and Audio Fusion
Integrating visual and audio data capitalizes on the spatial information from images and the temporal dynamics from sound, leading to more reliable drone identification. Systems like Deeplomatics use microphone arrays and cameras with deep learning models, such as BeamLearning networks, to achieve real-time localization and tracking with high accuracy (e.g., absolute 3D error below 7 degrees and detection precision over 90%). Other approaches employ CRNNs for audio feature extraction and ResNet50 for visual analysis, fused via cross-attention mechanisms to improve classification. In ideal conditions, such fusion can achieve accuracy rates up to 99.6%, as demonstrated in the MMUAD dataset. However, audio-visual fusion is vulnerable to ambient noise and requires robust algorithms to handle data disparities.
Visual and Radar Fusion
Combining visual and radar data merges high-resolution imagery with precise motion tracking, enhancing detection in complex scenarios. Deep semantic association networks, for instance, correlate image detections with radar points to improve tracking and prediction for small Unmanned Aerial Vehicles. In the UG2+ challenge, top-performing methods used LSTM networks and multi-target tracking with radar-camera fusion, achieving validation accuracy of 0.9998 and recall of 0.9184 on the MMUAD dataset. Lightweight CNN approaches have also been proposed, reducing computational load while maintaining high detection rates. These systems are particularly valuable for long-range or all-weather operations, though they involve higher integration costs and complexity.
Other Modal Fusion Methods
Beyond the common combinations, other modalities like RF signals and 3D coordinates offer additional perspectives. RF-based detection analyzes communication signals for drone identification, often integrated with acoustic and visual data in ensemble deep learning frameworks. For 3D localization, systems incorporate laser rangefinders or depth sensors to map 2D images to 3D space, achieving errors below 2.5 meters. Radar and audio fusion has been explored in low-cost setups, using multilayer perceptron (MLP) classifiers for detection within limited ranges (e.g., 50 meters). Comprehensive multi-sensor networks, including infrared, acoustic arrays, and lidar, provide modular solutions for urban environments, with average localization errors around 6 meters. These diverse approaches highlight the flexibility of multi-modal systems in addressing specific operational needs.
The table below summarizes the performance of various multi-modal algorithms:
| Modality | Reference | Neural Network | Dataset | Results |
|---|---|---|---|---|
| RGBT | Li et al. | Improved YOLOv5 | Custom | Precision: 96.2%, mAP: 96.3% |
| RGBT | Guo et al. | LRAF-Net | LLVIP | mAP: 97.9% |
| RGBT | Wang et al. | MDConvNNet | Anti-UAV | AP0.5:0.95: 66.1% |
| RGBT | Yu et al. | GhostFusion_Net | Anti-UAV | Average accuracy: 60% |
| RGBT | Han et al. | MAFSnet | Anti-UAV | AP0.5: 83% |
| Visual-Audio | Liu et al. | SVM | Custom | Accuracy: 95.74% |
| Visual-Audio | Bavu et al. | YOLO | Custom | Detection precision >90% |
| Visual-Audio | Yang et al. | CRNN, ResNet50 | MMUAD | Accuracy: 99.6% |
| Visual-Radar | Zhang et al. | Faster R-CNN | Custom | Accuracy: >90% |
| Visual-Radar | Huang et al. | YOLOv5 | Custom | Accuracy: 97.7%, mAP: 94.8% |
| Visual-Radar | Deng et al. | LSTM | MMUAD | Recall: 0.9184 |
| Visual-Radar | Wang et al. | ResNet50 | nuScenes | Accuracy improvement: 10.51% |
This comparison illustrates the effectiveness of multi-modal fusion in enhancing drone detection across different scenarios.
Performance Evaluation Metrics and Datasets
Evaluating drone detection systems requires standardized metrics and diverse datasets to ensure reliable performance assessment. Common metrics include accuracy, precision, recall, and F1-score, which are defined as follows:
Accuracy measures the proportion of correct predictions among all samples:
$$ \text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN} $$
where \( TP \) is true positives, \( TN \) is true negatives, \( FP \) is false positives, and \( FN \) is false negatives.
Precision quantifies the ratio of correctly identified drones to all positive predictions:
$$ \text{Precision} = \frac{TP}{TP + FP} $$
Recall, or sensitivity, indicates the ratio of correct positive predictions to all actual positives:
$$ \text{Recall} = \frac{TP}{TP + FN} $$
F1-score provides a balanced measure by combining precision and recall:
$$ \text{F1-score} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}} $$
These metrics are essential for comparing models, especially in scenarios involving small Unmanned Aerial Vehicles where false alarms and misses are critical.
Public datasets play a vital role in benchmarking drone detection algorithms. The table below summarizes key datasets, their modalities, and tasks:
| Dataset | Modalities | Labels | Tasks |
|---|---|---|---|
| MAV-VID | RGB | Manual | Tracking, Detection |
| Drone-vs-Bird | RGB | Manual | Detection, Tracking, Classification |
| Anti-UAV | RGB, TIR | Manual | Detection, Tracking |
| Svanström | RGB, TIR, Audio | Manual | Detection |
| UAV Audio Dataset | Audio | Automatic | Detection |
| DUT Anti-UAV | RGB | Manual | Detection, Tracking |
| Anti-UAV410 | TIR | Manual | Detection, Tracking |
| Zhang et al. | RGB, Audio, RF | Manual | Detection |
| MMAUD | RGB, Audio, Radar, Lidar | Automatic | Detection, Tracking, Classification |
These datasets facilitate the development and validation of multi-modal systems, though challenges such as data scarcity and diversity persist.
Problems and Challenges in Multi-Modal Methods
Despite the advantages, multi-modal drone detection faces several challenges. First, the significant scale variation of Unmanned Aerial Vehicles—from large pixels at close range to mere dots at a distance—demands adaptive resolution techniques or dynamic network architectures to maintain detection accuracy. Second, complex scenes with occlusions or cluttered backgrounds (e.g., drones near birds or other aircraft) require advanced context understanding algorithms, such as graph neural networks, to improve segmentation and identification. Third, real-time processing is crucial for applications like anti-drone surveillance, but multi-modal data fusion often increases computational load, leading to latency issues; optimizing model efficiency through lightweight designs and hardware acceleration is essential. Fourth, the scarcity and lack of diversity in public datasets limit model generalization; expanding datasets with varied scenarios and employing data augmentation, semi-supervised learning, or transfer learning can mitigate this. Finally, modality differences—such as disparate data characteristics, sampling rates, and noise levels—complicate fusion; developing robust synchronization methods and leveraging complementary information while minimizing redundancy are key research areas. Addressing these challenges will enhance the practicality and reliability of multi-modal systems in real-world drone technology applications.
Conclusion
The rapid evolution of drone technology has made Unmanned Aerial Vehicle detection a critical area of research, with deep learning-based methods significantly advancing small object recognition. However, single-modal approaches often fall short in complex environments due to limitations in feature representation and environmental adaptability. Multi-modal fusion, by integrating data from visual, infrared, audio, radar, and other sensors, offers a promising solution to improve accuracy and robustness. This review has outlined the progression from traditional to modern detection techniques, highlighted the benefits of multi-modal integration, and discussed evaluation metrics and datasets. Challenges such as scale variability, computational efficiency, and data heterogeneity remain, but future directions—including lightweight model design, real-time processing, and adaptive learning algorithms—hold great potential. By addressing these issues, multi-modal drone detection systems can achieve higher performance, enabling safer and more effective applications in security, transportation, and beyond, ultimately supporting the responsible advancement of drone technology.
