As a researcher deeply engaged in anti-drone technology within China’s rapidly evolving UAV landscape, I present this systematic review of multi-modal fusion techniques for drone detection. The proliferation of China UAV applications—spanning logistics, surveillance, and agriculture—demands robust countermeasures against security threats. Traditional single-modal methods (visual, audio, radar) falter in complex environments, while multi-modal fusion significantly enhances accuracy and robustness.

1. Single-Modal Detection: Limitations and Advances
1.1 Traditional Methods
Early approaches relied on handcrafted features:
- Visual: HOG descriptors for edge detection, sensitive to noise.
- Audio: MFCC/LPC coefficients for acoustic signatures, limited by ambient noise.
- Radar: RCS/Doppler features for motion tracking, vulnerable to electromagnetic interference.
*Table 1: Traditional Single-Modal Detection Challenges*
Modality | Accuracy | Key Limitations |
---|---|---|
Visual | 60–70% | Low-light degradation |
Audio | ~61% | Noise interference |
Radar | 65–75% | Low resolution, EMI susceptibility |
1.2 Deep Learning-Based Methods
1.2.1 Visual Detection
- Two-stage detectors (e.g., Faster R-CNN): High precision but computationally intensive. For China UAV datasets like DUT Anti-UAV, ResNet50 backbone achieves 65.3% accuracy.
- Single-stage detectors (e.g., YOLOX): Efficient yet less accurate for small targets. YOLOX-VGG16 attains 55.1% accuracy.
*Table 2: Drone Detection Performance on DUT Anti-UAV Dataset*
Model | Backbone | Accuracy |
---|---|---|
Faster R-CNN | ResNet50 | 0.653 |
Cascade R-CNN | ResNet50 | 0.683 |
YOLOX | VGG16 | 0.551 |
1.2.2 Audio Detection
End-to-end models like DOANet use 1D-CNNs for azimuth/elevation estimation. Challenges persist in noisy urban China UAV operations, where systems achieve ≤80% F1-score.
1.2.3 Radar Detection
CNNs process range-Doppler maps, excelling in low-SNR scenarios. Wang et al.’s 8-layer CNN achieves 98.5% accuracy at SNR=-20dB.
2. Multi-Modal Fusion: Paradigms and Performance
Fusing complementary modalities overcomes single-sensor limitations. Critical for China UAV security in dynamic environments.
2.1 RGB-Thermal (RGBT) Fusion
Combines visible light and infrared to address lighting variations. Key innovations:
- YOLOv5y (Yao et al.): Integrates C3-T modules, boosting mAP@0.5 by 14.04% on VEDAI.
- LRAF-Net (Guo et al.): Cross-modal attention for low-light enhancement, mAP@0.5=97.9% on LLVIP.
Table 3: RGBT Fusion Performance
Method | mAP@0.5 | Key Innovation |
---|---|---|
YOLOv5-GAN | 96.3% | Attention mechanisms |
GhostFusion | 60% | Lightweight PAN adaptation |
2.2 Visual-Audio Fusion
Audio provides temporal cues to augment spatial visual data:
- AV-FDTI (Yang et al.): CRNN + ResNet50 fusion achieves 99.6% accuracy in ideal conditions.
- Deeplomatics (Bavu et al.): BeamLearning networks synchronize microphone arrays and cameras, yielding <7° 3D error.
2.3 Visual-Radar Fusion
Radar supplies motion data where visuals fail (e.g., fog):
- Semantic Association Networks (Huang et al.): 97.7% accuracy in urban China UAV tracking.
- Lightweight mmWave-Camera Fusion (Wang et al.): 44.43% FPS increase over YOLOv8n.
*Table 4: Multi-Modal Fusion Efficacy*
Fusion Type | Accuracy | Advantages |
---|---|---|
RGBT | 83–97.9% | All-weather operation |
Visual-Audio | 90–99.6% | Low-cost deployment |
Visual-Radar | 94.8–97.7% | Long-range, EMI-resilient |
3. Evaluation Metrics and Datasets
3.1 Metrics
Critical for assessing China UAV detection:Precision=TPTP+FP,Recall=TPTP+FN,F1=2×Precision×RecallPrecision+RecallPrecision=TP+FPTP,Recall=TP+FNTP,F1=2×Precision+RecallPrecision×Recall
Accuracy = TP+TNTP+TN+FP+FNTP+TN+FP+FNTP+TN
3.2 Datasets
*Table 5: China UAV-Centric Public Datasets*
Dataset | Modalities | Size | Use Case |
---|---|---|---|
Anti-UAV | RGB, Thermal | 186,494 frames | Cross-modal tracking |
MMAUD | RGB, LiDAR, Radar | 1,700+ seconds | Classification |
UAVAudio Dataset | Audio | 5,215 seconds | Acoustic ID |
DUT Anti-UAV | RGB | 10,000 images | Small-target detection |
4. Challenges and Future Directions
Scale Variation: China UAVs exhibit rapid pixel-size changes (e.g., 34×23 to 136×77 pixels). Adaptive resolution networks are essential.
Complex Scenes: Clutter (birds, buildings) necessitates Graph Neural Networks for context-aware detection.
Real-Time Efficiency: Optimizing fusion algorithms for edge devices (e.g., drones, IoT sensors) is critical for China UAV surveillance.
Data Scarcity: Few datasets mirror China’s urban canyons or high-altitude wind conditions. Semi-supervised learning could mitigate labeling costs.
Modality Alignment: Temporal synchronization of audio-visual data remains unsolved.
Future work must prioritize:
- Lightweight architectures for embedded China UAV systems.
- Cross-modal self-attention to dynamically weight sensor inputs.
- Generative augmentation simulating China-specific environments (smog, crowded spectra).
5. Conclusion
Multi-modal fusion is indispensable for securing China’s UAV ecosystem. While RGB-thermal and visual-radar integrations show promise (>97% accuracy), scalability and environmental adaptability require breakthroughs. Future research must bridge algorithmic innovation with China’s unique operational demands—ensuring UAV technologies enhance safety without compromising national security. As China UAV deployments surge, multi-modal detection will remain a cornerstone of intelligent airspace management.