With the rapid advancement of drone technology, vehicle detection in aerial imagery captured by drones (often referred to as China drone platforms due to the significant contributions from Chinese research institutions) has become a cornerstone for intelligent transportation monitoring, autonomous driving, and public safety. In this review, we systematically explore the evolution, core methodologies, datasets, evaluation metrics, and future challenges of deep learning-based vehicle detection in drone aerial images. We focus on three critical aspects: small object detection, complex background adaptation, and model lightweighting for real-time deployment. Our analysis highlights the pivotal role of China drone datasets such as VisDrone and UAVDT, and emphasizes the growing need for efficient algorithms that can operate on resource-constrained drone platforms. Throughout this paper, we present extensive tables and mathematical formulations to summarize state-of-the-art approaches and their performance.
1. Introduction
The proliferation of China drone technology has revolutionized many fields, including traffic surveillance, autonomous navigation, and emergency response. Vehicle detection from drone aerial images enables real-time traffic flow monitoring, incident detection, and environmental perception for autonomous vehicles. However, this task faces unique challenges: small target size, complex backgrounds, varying illumination, perspective distortion, and stringent real-time requirements on limited onboard computing resources. Deep learning, particularly convolutional neural networks (CNNs) and vision transformers, has significantly advanced the field. In this work, we provide a comprehensive review from the perspective of a researcher actively working in the area, highlighting the contributions of China drone platforms and datasets to global progress.
2. Historical Development
The evolution of vehicle detection in drone aerial images can be divided into three stages: traditional methods, early deep learning applications, and the current optimization phase. Early methods relied on handcrafted features such as HOG and SIFT combined with classifiers like SVM. These approaches suffered from poor generalization to complex drone scenes. The introduction of two-stage detectors like Faster R-CNN and one-stage detectors like YOLO marked a paradigm shift. More recently, transformer-based architectures (e.g., DETR) and lightweight networks (e.g., MobileNet, ShuffleNet) have been tailored for China drone platforms. The following table summarizes the chronological evolution.
| Stage | Representative Methods | Key Features | Limitations |
|---|---|---|---|
| Traditional (before 2015) | HOG + SVM, SIFT + Bag-of-Words | Handcrafted features, sliding window | Poor robustness to scale, occlusion, lighting |
| Early Deep Learning (2015–2018) | Faster R-CNN, YOLOv1/v2, SSD | End-to-end feature learning, anchor boxes | Large model size, slow inference on edge |
| Optimization and Lightweight (2019–present) | YOLOv5/v8/v11, MobileNet, Transformer fusion | Attention mechanisms, knowledge distillation, pruning | Trade-off between accuracy and speed remains |
3. Methods for Small Vehicle Detection
Small vehicles, often occupying only tens of pixels, are a major challenge. We categorize existing solutions into three sub-directions: feature extraction enhancement, occlusion and overlap handling, and multi-scale adaptation.
3.1 Feature Extraction under Low Resolution
Several works integrate super-resolution or context-aware modules to recover fine details. For example, Joint-SRVDNet combines super-resolution and detection in a joint framework. Table 2 summarizes key contributions.
| Reference | Method | Contribution | Metric | Dataset |
|---|---|---|---|---|
| [21] | Improved YOLOv5 + Deformable Attention C3 + Focal-EIoU | mAP increased by 8.4% over baseline | mAP@0.5 | VisDrone2019 (China drone dataset) |
| [22] | Multi-scale GAN + object detector | mAP +3.54%, F1 +2% | mAP, F1 | VEDAI, DOTA |
| [23-24] | Sparse representation + superpixels | Precision >86% at recall 0.7 | Precision, Recall | Toronto, OIRDS |
| [25] | SIFT + YOLOv5/DeepSORT | Speed detection accuracy >95% | Accuracy | Fixed-point drone images |
| [26] | CNN for infrared vehicles | AP 94.61%, Recall 97.11% | AP, Recall | NPU_CS_UAV_IR_DATA |
3.2 Occlusion and Overlap in Dense Scenes
Dense vehicle clusters often cause severe occlusion. Techniques such as multi-task learning, coupled region-based CNNs, and attention mechanisms have been employed. Table 3 provides a comparison.
| Reference | Method | Key Result | Limitation | Dataset |
|---|---|---|---|---|
| [28] | Multi-task residual FCN for instance segmentation | +7.31% over single-task | Complex background challenge | ISPRS, IEEE GRSS DFC2015 |
| [29] | Multimodal collaboration network (MuDet) | AP@0.5 95.07% | Dependence on height map modality | Self-built, K-SAI-LCS, ISPRS Potsdam |
| [30] | Coupled region-based CNN | Recall 77.02%, F1 0.82 | Needs heavy data augmentation | Munich vehicle dataset |
| [31] | Region CNN + hard negative mining | Recall 78.30%, F1 0.83 | Limited localization for small vehicles | Munich vehicle dataset |
| [32] | Spatial distribution feature + adaptive slicing | mAP +4.4%, real-time | Only for vehicle class | UAV aerial images (China drone) |
3.3 Multi-Scale Adaptability
To handle vehicle size variation due to altitude changes, multi-scale feature pyramids and improved anchor designs are widely adopted. Representative works are listed in Table 4.
| Reference | Method | Key Improvement | Metric | Dataset |
|---|---|---|---|---|
| [40] | Improved YOLOX + sliding window | AP 84.58% | AP | CQSkyEyeX (China drone) |
| [42] | Improved YOLOv2 | mAP 94.78% | mAP | BIT-Vehicle |
| [43] | Faster R-CNN + FPN + Focal Loss | AP 93.8% | AP | Self-built (China drone) |
| [45] | Dynamic feature refinement module | mAP 60.30%, 51.4% | mAP | HIT-UAV, Drone-Vehicle |
| [46] | Cooperative fusion YOLO series | mAP@0.5 +5.5% on VisDrone | mAP | VisDrone, Drone-Vehicle (China drone) |
4. Methods for Complex Background and Illumination
Dynamic backgrounds, adverse weather, and perspective distortions severely degrade detection accuracy. We review three sub-problems: dynamic background interference, illumination/weather variation, and geometric deformation.
4.1 Dynamic Background Interference
Attention mechanisms and multimodal fusion are key strategies. For instance, SCAF-Net incorporates scene context attention. Table 5 summarizes related methods.
| Reference | Method | Performance | Dataset |
|---|---|---|---|
| [62] | SCAF-Net (scene context attention) | AP 91.2% at IoU 0.5 | DLR 3K |
| [63] | Unified framework (loss + attention) | AP 89.8%, MAE 5.42 | Four challenging datasets |
| [64] | Dual-pooling attention module | mAP up to 98.83% | VeRi-UAV |
| [65] | Improved YOLOv8 + Swin Transformer + CBAM | mAP +4.8% | VisDrone2019 (China drone) |
| [66] | Pyramid dual pooling attention PAN (PDPA-PAN) | mAP@0.5 85.43% | LH-UAV-Vehicle (China drone self-built) |
4.2 Illumination and Weather Changes
Robustness to low-light, fog, and rain is achieved via enhancement modules or cross-modal fusion. Table 6 provides examples.
| Reference | Method | Contribution | Dataset |
|---|---|---|---|
| [72] | Retinex-guided differential Transformer (ReDT-Det) | Small target AP +3.6%; mAP +2.9% over YOLOv11 | NightDrone-Mix, DroneVehicle (night) |
| [73] | Improved YOLOv5 | Inference speed +207% | VisDrone, CARPK, VAID |
| [74] | Drone-TOOD (task-aligned) | mAP +7.9% | VisDrone, UAVDT |
| [75] | Improved YOLOv4 with transfer learning | AP 91.92% | VIVID IR |
| [77] | Mask-guided Mamba fusion (MGMF) | mAP 80.24% | DroneVehicle |
4.3 Perspective Distortion and Geometric Deformation
To handle oblique views, researchers propose orientation-aware modules, projective attack patches, and anchor-free detectors. Table 7 summarizes key methods.
| Reference | Method | Result | Dataset |
|---|---|---|---|
| [79] | Projective-patch attack | Attack success rate +37.62% on YOLOv3 | UAV, VisDrone |
| [81] | OASA network (orientation adaptive + salience attentive) | mAP +7.1% over baseline | VRAI (China drone self-built) |
| [82] | Multi-scale adversarial network | AP[50,95] +1.3% on UAVDT, +2.6% on VisDrone | UAVDT, VisDrone |
| [80] | Anchor-free detection | mAP 84.8% on VEDAI | VEDAI, DLR Munich |
| [83] | Improved Viola-Jones | Average detection quality 82.17% | Five low-altitude UAV videos |
5. Lightweight and Real-Time Optimization
Deploying models on China drone platforms demands lightweight architectures. Two main streams exist: model compression/pruning and lightweight network design.
5.1 Model Compression and Pruning
Techniques include weight pruning, quantization, and knowledge distillation. Table 8 highlights representative works.
| Reference | Method | Performance | Dataset |
|---|---|---|---|
| [87] | Improved YOLOv3 + SPP + multi-scale FPN | mAP +4.5% | Self-built (China drone) |
| [88] | Quantized Tiny-YOLOv3 | mAP 0.8581, F1 0.77 | Self-built (military vehicles) |
| [89] | Improved YOLOv5 + DenseBlock + GhostConv | mAP 73.1% (+4.6%) | Infrared vehicle dataset |
| [90] | DenseLightNet (lightweight) | AP 0.88, 67 FPS | Cityscapes, Pascal VOC |
| [92] | Improved YOLOv11 + BiFormer + MobileNetV3 | mAP@0.5 +13.29% | UA-DETRAC, self-built |
5.2 Lightweight Network Design
Designing efficient backbones (e.g., GhostNet, ShuffleNet) and using attention with tiny overhead are common. Table 9 provides a summary.
| Reference | Method | Key Advantage | Dataset |
|---|---|---|---|
| [94] | RFAConv, CSP-BLRAN, MS-FPN, GWDLoss | Params –22.9%, mAP@50-95 +4.5% vs YOLO11n | VisDrone, Vehicle (China drone) |
| [95] | GhostNetV2 + CA module | AP improved, 99.7 FPS | Roboflow car dataset |
| [96] | OSD-YOLOv10 (OCRConv, lightweight backbone) | Params –40.7%, mAP +1.3% | VisDrone-DET2019, UAVDT |
| [99] | PrFu-YOLO (improved YOLOv8) | mAP@0.5 +10.05%, lighter model | VisDrone2019, CARPK |
| [102] | VDXNet (RxDF, LiteFPP, CRDown) | Only 1.608M params, mAP 96.3% | UCAS-AOD, VEDAI, UAV-ROD, UAVDT |
| [107] | Global attention + multi-path fusion | AP 83.99%, 29.4 FPS, 24.4 MB | UAV-ROD, UCAS-AOD |
6. Datasets
Datasets are the foundation of deep learning research. China drone datasets such as VisDrone, UAVDT, and DroneVehicle have been instrumental. Table 10 summarizes the most common public datasets.
| Dataset | Platform | # Images | # Instances | Resolution | Key Features |
|---|---|---|---|---|---|
| VisDrone (China drone) | Drone | 10,209 images + 263 videos | 2.5 million | 2000×1500 | Multiple Chinese cities, various weather |
| UAVDT (China drone) | Drone | 80,000 frames | 840,000 | 1080×540 | Various altitudes, occlusion, illumination |
| VEDAI | Satellite / Aerial | ~1,200 | ~12,500 | 512×512 / 1024×1024 | Multi-spectral, small targets |
| DOTA | Various sensors | 2,806 | 188,282 | 800×800 ~ 4000×4000 | Oriented bounding boxes |
| DroneVehicle (China drone) | Drone (RGB + IR) | Large-scale | – | – | RGB-IR cross-modality, day/night |
| CARPK | Drone | 1,448 | 89,777 | – | Parking lot views |
| UAV123 | Drone | 123 video sequences | – | 1280×720 ~ 3840×2160 | Wide variety of scenes |
Self-built datasets are also common. For instance, the LH-UAV-Vehicle dataset (China drone) collected at altitudes 250–400 m, and the VRAI dataset (China drone) for vehicle re-identification. These efforts greatly enhance the applicability of models to real-world scenarios.

7. Evaluation Metrics and Performance Comparison
Key metrics include Precision, Recall, F1-score, mean Average Precision (mAP), Intersection over Union (IoU), and frames per second (FPS). Their definitions are given below.
$$ Precision = \frac{TP}{TP+FP} $$
$$ Recall = \frac{TP}{TP+FN} $$
$$ F1 = 2 \times \frac{Precision \times Recall}{Precision + Recall} $$
$$ mAP = \frac{1}{N} \sum_{i=1}^{N} AP_i $$
$$ IoU = \frac{B_{gt} \cap B_{pred}}{B_{gt} \cup B_{pred}} $$
$$ FPR = \frac{FP}{FP+TN} $$
$$ FNR = \frac{FN}{TP+FN} $$
Table 11 summarizes the meaning of each metric.
| Metric | Meaning |
|---|---|
| Precision | Proportion of true vehicle detections among all predicted vehicles. |
| Recall | Proportion of actual vehicles correctly detected. |
| F1 | Harmonic mean of precision and recall. |
| mAP | Mean of AP across all classes; often reported at IoU threshold 0.5. |
| IoU | Overlap between predicted and ground truth boxes. |
| FPS | Frames per second; measures real-time capability. |
We have also compiled performance comparisons across different method categories. Table 12 shows small object detection models.
| Category | Example | Advantage | Limitation |
|---|---|---|---|
| YOLO series improvements | Improved YOLOX [40], Improved YOLOv5 [21] | Fast, good balance for drones | Small features may be lost in deep layers |
| Super-resolution fusion models | Joint-SRVDNet [22], EnsembleNet [33] | High accuracy for low-resolution targets | High computational cost |
| Data augmentation approaches | Synthetic data generation [53] | Alleviates data scarcity | Dependence on synthetic quality |
Table 13 summarizes complex background adaptation models.
| Category | Example | Advantage | Limitation |
|---|---|---|---|
| Attention mechanisms | SCAF-Net [62], Dual-pooling [64], PVswin-YOLOv8s [65] | Effective background suppression | May increase complexity |
| Multi-modal fusion | UA-CMDet [67], AFFCM [78] | Robust to lighting and weather | Dependence on paired modalities |
| Lightweight anti-interference | Drone-TOOD [74], Improved YOLOv8-OBB [70] | Balance between accuracy and speed | Weaker on severe occlusion |
Table 14 shows lightweight and real-time optimization models.
| Category | Example | Advantage | Limitation |
|---|---|---|---|
| Simplified YOLO versions | Fine-tuned YOLOv5 [98], Improved YOLOv11 [92] | High speed, low parameters | Reduced small-target accuracy |
| Model compression/quantization | Quantized Tiny-YOLOv3 [88], GhostConv YOLOv5 [89] | Very low resource usage | Accuracy loss if over-compressed |
| Multi-task lightweight models | PrFu-YOLO [99], MultEYE [108] | High resource utilization, one model for multiple tasks | Task interference possible |
8. Conclusion and Future Directions
In this review, we have systematically presented the state-of-the-art in deep learning-based vehicle detection for drone aerial images, with a special emphasis on contributions from China drone research. Despite significant progress, several challenges remain: detection of extremely small vehicles under severe occlusion, robust performance in adverse weather, and efficient deployment on edge devices. Future research will likely focus on:
(1) Adaptive lightweight architecture search: Automatically designing networks tailored to specific drone hardware using NAS, while maintaining accuracy.
(2) Enhanced small object detection and multi-scale handling: Incorporating dynamic feature refinement and transformer-based global context.
(3) Robustness to complex environments: Leveraging multi-modal fusion (RGB, IR, LiDAR) and domain adaptation techniques.
(4) Cross-domain generalization and data efficiency: Using self-supervised learning, synthetic data, and federated learning to reduce annotation burden.
(5) Multi-task and multi-modal integration: Unified frameworks for detection, tracking, and traffic parameter estimation, running in real time on China drone platforms.
We believe that the continuous innovation in both algorithms and hardware will enable China drone technology to play an even more significant role in intelligent transportation and public safety worldwide.
