A Comprehensive Review of Drone Recognition Based on Multi-Modal Fusion

With the rapid advancement of drone technology, Unmanned Aerial Vehicles (UAVs) have become increasingly prevalent in various applications, including aerial photography, logistics, agriculture, and rescue operations. However, the widespread use of drone technology also introduces significant security risks and regulatory challenges, such as unauthorized incursions into restricted areas, privacy violations, and potential cyber-attacks. As a critical component of counter-UAV systems, drone detection has garnered substantial attention. Traditional detection methods rely on single-modal data, such as visual, audio, radar, and radio frequency signals, but these approaches often provide limited information in complex scenarios. Recent progress in deep learning has significantly improved small object detection, and multi-modal fusion techniques have further enhanced the accuracy and robustness of target recognition. This article reviews the state-of-the-art in drone detection, with a focus on multi-modal fusion research. Additionally, it summarizes relevant evaluation metrics and public datasets, analyzes existing limitations, and suggests future research directions to improve detection precision and resilience.

Drone technology has evolved rapidly, enabling Unmanned Aerial Vehicles to perform tasks that were previously challenging or impossible. However, the proliferation of drone technology necessitates effective detection systems to mitigate security threats. Early detection methods primarily utilized single-modal approaches, which, while effective in controlled environments, often falter in real-world conditions due to factors like occlusion, noise, and environmental variability. The integration of multiple modalities, such as combining visual data with infrared, audio, or radar signals, offers a promising solution to these challenges. This review delves into the evolution of drone detection, from traditional techniques to advanced deep learning-based methods, and emphasizes the transformative potential of multi-modal fusion in enhancing the performance of Unmanned Aerial Vehicle recognition systems.

Single-Modal Drone Detection Methods

Single-modal detection methods form the foundation of early drone recognition systems. These approaches leverage data from one type of sensor, such as cameras, microphones, or radar, to identify and track Unmanned Aerial Vehicles. Despite their simplicity, they face limitations in complex scenarios, prompting the development of more sophisticated techniques.

Early Traditional Drone Detection Methods

Traditional drone detection methods typically follow a three-step process: region selection, feature extraction, and classification or regression. For instance, in visual detection, algorithms like the Histogram of Oriented Gradients (HOG) extract gradient-based features to outline targets, reducing dimensionality and mitigating environmental influences like illumination. However, HOG is sensitive to noise and may not perform well with small objects like drones. In audio-based detection, features such as Mel-Frequency Cepstral Coefficients (MFCCs) or power spectral density are used to analyze the unique acoustic signatures of drones, but these methods struggle in noisy environments. Radar-based detection relies on characteristics like Radar Cross-Section (RCS) and Doppler shift to estimate distance, speed, and trajectory, offering robustness in adverse weather but being susceptible to electromagnetic interference. Similarly, radio frequency (RF) signals provide an alternative by analyzing spectrum and modulation features, enabling detection beyond visual range, though they can be affected by signal clutter and interference.

The general workflow for traditional radar detection can be summarized as follows: signal preprocessing involves de-interleaving and sorting pulse streams, while main processing focuses on drone detection and identification using extracted features. Despite some success, these methods often lack the adaptability required for dynamic environments, leading to the adoption of deep learning techniques.

Deep Learning-Based Drone Detection Methods

Deep learning has revolutionized drone detection by automating feature extraction and improving accuracy, particularly for small objects like Unmanned Aerial Vehicles. These methods can be categorized into two-stage and one-stage detectors, each with distinct advantages and drawbacks.

Visual Detection of Drones

In visual detection, two-stage algorithms, such as the R-CNN series, first generate region proposals and then classify and regress these regions. For example, Faster R-CNN uses a Region Proposal Network (RPN) to efficiently propose candidate areas, while Mask R-CNN extends this by adding a branch for pixel-level segmentation, improving precision in dynamic scenarios. However, these methods can be computationally intensive, making them less suitable for real-time applications. In contrast, one-stage detectors like YOLO (You Only Look Once) and SSD (Single Shot MultiBox Detector) perform detection in a single pass, offering higher speed at the cost of potential accuracy loss. Recent variants, such as YOLOX and YOLOv8, have enhanced performance through architectural innovations, making them more effective for drone detection in resource-constrained settings.

To illustrate the performance of different models, consider the following table summarizing results on the DUT Anti-UAV dataset with various backbone networks:

Model	Backbone Network	Accuracy
Faster R-CNN	ResNet50	0.653
Faster R-CNN	ResNet18	0.605
Faster R-CNN	VGG16	0.633
Cascade R-CNN	ResNet50	0.683
Cascade R-CNN	ResNet18	0.652
Cascade R-CNN	VGG16	0.667
YOLOX	ResNet50	0.427
YOLOX	ResNet18	0.400
YOLOX	VGG16	0.551

These results highlight the trade-offs between accuracy and efficiency, with two-stage methods generally achieving higher precision but requiring more computational resources.

Audio-Based Drone Detection

Audio signals provide complementary information for drone recognition, as Unmanned Aerial Vehicles emit distinct sounds during flight. Deep learning models, such as Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs), have been employed to extract high-level features from audio data. For instance, CNNs can process spectrograms to identify drone-specific patterns, while RNNs capture temporal dynamics for improved detection in real-time scenarios. However, audio-based methods are highly sensitive to environmental noise, limiting their effectiveness in urban or noisy settings.

The general pipeline for audio event detection involves signal acquisition, feature extraction (e.g., using MFCCs or Fourier transforms), and classification via deep learning models. The following table summarizes key studies in audio-based drone detection:

Reference	Network	Dataset	Results
Kim et al.	CNN	Custom	Accuracy: 61%
Jeon et al.	CNN, RNN	DARES-G1, Litis Rouen	F-scores: 0.5232, 0.6415, 0.8009
Qayyum et al.	CNN	DREGON	Superior AUC compared to benchmarks
Wang et al.	DNN	AS, AVQ, DREGON	Improved source localization with noise suppression
Luo et al.	CNN	Custom	Average recognition accuracy: 97.88%

These approaches demonstrate the potential of audio modalities, especially when integrated with other sensors.

Radar-Based Drone Detection

Radar systems emit electromagnetic waves and analyze reflected signals to detect moving targets, providing reliable performance in various weather conditions. Deep learning techniques, such as CNNs applied to range-Doppler maps or spectrograms, have enhanced the classification and detection of drones. For example, CNNs can distinguish drones from other objects based on unique radar signatures, even in low signal-to-noise ratio (SNR) environments. However, radar data may suffer from multipath effects and limited resolution, necessitating fusion with other modalities for comprehensive coverage.

The table below outlines notable radar-based detection studies:

Reference	Network	Dataset	Results
El Housseini et al.	CNN	MSTAR	Accuracy: 93%
Kim et al.	CNN	Custom	Classification accuracy: 98.89%
Wang et al.	DNN	Custom	Better performance at high SNR
Raval et al.	ResNet	Custom	F-score: 0.816 at 10 dB SNR
Dale et al.	CNN	Custom	Overall accuracy: 98.89%
Wang et al.	CNN	Custom	Accuracy: 98.5% at -20 dB SNR

These advancements underscore the role of radar in all-weather drone detection, though challenges remain in handling small, fast-moving targets.

Challenges and Future Directions in Single-Modal Detection

Single-modal methods face several limitations: visual detection is impaired by poor lighting or weather; audio detection requires quiet environments; and radar may produce false alarms due to interference. The following table compares the characteristics of different single-modal approaches:

Method	Advantages	Challenges	Applicable Scenarios
Visual	Rich information content	Poor performance in darkness or fog	Daylight or well-lit conditions
Audio	Low cost, easy deployment	Susceptible to noise, limited range	Quiet environments with budget constraints
Radar	Robust to environmental changes	Lower resolution	Adverse weather or low-light situations

Future research should focus on lightweight model designs, real-time processing, and the integration of reinforcement learning or self-supervised techniques to enhance adaptability. Multi-modal fusion emerges as a key direction to address these challenges by combining complementary data sources.

Multi-Modal Drone Detection Methods

Multi-modal fusion leverages data from multiple sensors to improve detection accuracy and robustness. Common combinations include RGB and thermal infrared (RGBT), visual and audio, and visual and radar fusion, each offering unique benefits for drone technology applications.

RGBT Fusion

RGBT fusion combines visible light (RGB) and thermal infrared (TIR) images to overcome limitations in varying lighting conditions. While RGB provides detailed visual information, TIR excels in darkness or fog, enabling all-weather detection. Deep learning approaches, such as generative adversarial networks (GANs) for image fusion or attention mechanisms in YOLO-based models, have been developed to enhance feature extraction. For example, modified YOLOv5 architectures with multi-scale differential attention modules have shown significant improvements in mean Average Precision (mAP) on datasets like LLVIP, achieving up to 97.9% mAP at an IoU threshold of 0.5. Similarly, adaptive fusion networks assign weights to different modalities, optimizing performance in dynamic environments. These methods demonstrate the synergy between modalities, but they require careful handling of data synchronization and computational complexity.

Visual and Audio Fusion

Integrating visual and audio data capitalizes on the spatial information from images and the temporal dynamics from sound, leading to more reliable drone identification. Systems like Deeplomatics use microphone arrays and cameras with deep learning models, such as BeamLearning networks, to achieve real-time localization and tracking with high accuracy (e.g., absolute 3D error below 7 degrees and detection precision over 90%). Other approaches employ CRNNs for audio feature extraction and ResNet50 for visual analysis, fused via cross-attention mechanisms to improve classification. In ideal conditions, such fusion can achieve accuracy rates up to 99.6%, as demonstrated in the MMUAD dataset. However, audio-visual fusion is vulnerable to ambient noise and requires robust algorithms to handle data disparities.

Visual and Radar Fusion

Combining visual and radar data merges high-resolution imagery with precise motion tracking, enhancing detection in complex scenarios. Deep semantic association networks, for instance, correlate image detections with radar points to improve tracking and prediction for small Unmanned Aerial Vehicles. In the UG2+ challenge, top-performing methods used LSTM networks and multi-target tracking with radar-camera fusion, achieving validation accuracy of 0.9998 and recall of 0.9184 on the MMUAD dataset. Lightweight CNN approaches have also been proposed, reducing computational load while maintaining high detection rates. These systems are particularly valuable for long-range or all-weather operations, though they involve higher integration costs and complexity.

Other Modal Fusion Methods

Beyond the common combinations, other modalities like RF signals and 3D coordinates offer additional perspectives. RF-based detection analyzes communication signals for drone identification, often integrated with acoustic and visual data in ensemble deep learning frameworks. For 3D localization, systems incorporate laser rangefinders or depth sensors to map 2D images to 3D space, achieving errors below 2.5 meters. Radar and audio fusion has been explored in low-cost setups, using multilayer perceptron (MLP) classifiers for detection within limited ranges (e.g., 50 meters). Comprehensive multi-sensor networks, including infrared, acoustic arrays, and lidar, provide modular solutions for urban environments, with average localization errors around 6 meters. These diverse approaches highlight the flexibility of multi-modal systems in addressing specific operational needs.

The table below summarizes the performance of various multi-modal algorithms:

Modality	Reference	Neural Network	Dataset	Results
RGBT	Li et al.	Improved YOLOv5	Custom	Precision: 96.2%, mAP: 96.3%
RGBT	Guo et al.	LRAF-Net	LLVIP	mAP: 97.9%
RGBT	Wang et al.	MDConvNNet	Anti-UAV	AP0.5:0.95: 66.1%
RGBT	Yu et al.	GhostFusion_Net	Anti-UAV	Average accuracy: 60%
RGBT	Han et al.	MAFSnet	Anti-UAV	AP0.5: 83%
Visual-Audio	Liu et al.	SVM	Custom	Accuracy: 95.74%
Visual-Audio	Bavu et al.	YOLO	Custom	Detection precision >90%
Visual-Audio	Yang et al.	CRNN, ResNet50	MMUAD	Accuracy: 99.6%
Visual-Radar	Zhang et al.	Faster R-CNN	Custom	Accuracy: >90%
Visual-Radar	Huang et al.	YOLOv5	Custom	Accuracy: 97.7%, mAP: 94.8%
Visual-Radar	Deng et al.	LSTM	MMUAD	Recall: 0.9184
Visual-Radar	Wang et al.	ResNet50	nuScenes	Accuracy improvement: 10.51%

This comparison illustrates the effectiveness of multi-modal fusion in enhancing drone detection across different scenarios.

Performance Evaluation Metrics and Datasets

Evaluating drone detection systems requires standardized metrics and diverse datasets to ensure reliable performance assessment. Common metrics include accuracy, precision, recall, and F1-score, which are defined as follows:

Accuracy measures the proportion of correct predictions among all samples:
$$ \text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN} $$
where $ TP $ is true positives, $ TN $ is true negatives, $ FP $ is false positives, and $ FN $ is false negatives.

Precision quantifies the ratio of correctly identified drones to all positive predictions:
$$ \text{Precision} = \frac{TP}{TP + FP} $$

Recall, or sensitivity, indicates the ratio of correct positive predictions to all actual positives:
$$ \text{Recall} = \frac{TP}{TP + FN} $$

F1-score provides a balanced measure by combining precision and recall:
$$ \text{F1-score} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}} $$

These metrics are essential for comparing models, especially in scenarios involving small Unmanned Aerial Vehicles where false alarms and misses are critical.

Public datasets play a vital role in benchmarking drone detection algorithms. The table below summarizes key datasets, their modalities, and tasks:

Dataset	Modalities	Labels	Tasks
MAV-VID	RGB	Manual	Tracking, Detection
Drone-vs-Bird	RGB	Manual	Detection, Tracking, Classification
Anti-UAV	RGB, TIR	Manual	Detection, Tracking
Svanström	RGB, TIR, Audio	Manual	Detection
UAV Audio Dataset	Audio	Automatic	Detection
DUT Anti-UAV	RGB	Manual	Detection, Tracking
Anti-UAV410	TIR	Manual	Detection, Tracking
Zhang et al.	RGB, Audio, RF	Manual	Detection
MMAUD	RGB, Audio, Radar, Lidar	Automatic	Detection, Tracking, Classification

These datasets facilitate the development and validation of multi-modal systems, though challenges such as data scarcity and diversity persist.

Problems and Challenges in Multi-Modal Methods

Despite the advantages, multi-modal drone detection faces several challenges. First, the significant scale variation of Unmanned Aerial Vehicles—from large pixels at close range to mere dots at a distance—demands adaptive resolution techniques or dynamic network architectures to maintain detection accuracy. Second, complex scenes with occlusions or cluttered backgrounds (e.g., drones near birds or other aircraft) require advanced context understanding algorithms, such as graph neural networks, to improve segmentation and identification. Third, real-time processing is crucial for applications like anti-drone surveillance, but multi-modal data fusion often increases computational load, leading to latency issues; optimizing model efficiency through lightweight designs and hardware acceleration is essential. Fourth, the scarcity and lack of diversity in public datasets limit model generalization; expanding datasets with varied scenarios and employing data augmentation, semi-supervised learning, or transfer learning can mitigate this. Finally, modality differences—such as disparate data characteristics, sampling rates, and noise levels—complicate fusion; developing robust synchronization methods and leveraging complementary information while minimizing redundancy are key research areas. Addressing these challenges will enhance the practicality and reliability of multi-modal systems in real-world drone technology applications.

Conclusion

The rapid evolution of drone technology has made Unmanned Aerial Vehicle detection a critical area of research, with deep learning-based methods significantly advancing small object recognition. However, single-modal approaches often fall short in complex environments due to limitations in feature representation and environmental adaptability. Multi-modal fusion, by integrating data from visual, infrared, audio, radar, and other sensors, offers a promising solution to improve accuracy and robustness. This review has outlined the progression from traditional to modern detection techniques, highlighted the benefits of multi-modal integration, and discussed evaluation metrics and datasets. Challenges such as scale variability, computational efficiency, and data heterogeneity remain, but future directions—including lightweight model design, real-time processing, and adaptive learning algorithms—hold great potential. By addressing these issues, multi-modal drone detection systems can achieve higher performance, enabling safer and more effective applications in security, transportation, and beyond, ultimately supporting the responsible advancement of drone technology.