Advances of Vehicle Detection in Drone Aerial Images Based on Deep Learning

With the rapid advancement of drone technology, vehicle detection in aerial imagery captured by drones (often referred to as China drone platforms due to the significant contributions from Chinese research institutions) has become a cornerstone for intelligent transportation monitoring, autonomous driving, and public safety. In this review, we systematically explore the evolution, core methodologies, datasets, evaluation metrics, and future challenges of deep learning-based vehicle detection in drone aerial images. We focus on three critical aspects: small object detection, complex background adaptation, and model lightweighting for real-time deployment. Our analysis highlights the pivotal role of China drone datasets such as VisDrone and UAVDT, and emphasizes the growing need for efficient algorithms that can operate on resource-constrained drone platforms. Throughout this paper, we present extensive tables and mathematical formulations to summarize state-of-the-art approaches and their performance.

1. Introduction

The proliferation of China drone technology has revolutionized many fields, including traffic surveillance, autonomous navigation, and emergency response. Vehicle detection from drone aerial images enables real-time traffic flow monitoring, incident detection, and environmental perception for autonomous vehicles. However, this task faces unique challenges: small target size, complex backgrounds, varying illumination, perspective distortion, and stringent real-time requirements on limited onboard computing resources. Deep learning, particularly convolutional neural networks (CNNs) and vision transformers, has significantly advanced the field. In this work, we provide a comprehensive review from the perspective of a researcher actively working in the area, highlighting the contributions of China drone platforms and datasets to global progress.

2. Historical Development

The evolution of vehicle detection in drone aerial images can be divided into three stages: traditional methods, early deep learning applications, and the current optimization phase. Early methods relied on handcrafted features such as HOG and SIFT combined with classifiers like SVM. These approaches suffered from poor generalization to complex drone scenes. The introduction of two-stage detectors like Faster R-CNN and one-stage detectors like YOLO marked a paradigm shift. More recently, transformer-based architectures (e.g., DETR) and lightweight networks (e.g., MobileNet, ShuffleNet) have been tailored for China drone platforms. The following table summarizes the chronological evolution.

Stage	Representative Methods	Key Features	Limitations
Traditional (before 2015)	HOG + SVM, SIFT + Bag-of-Words	Handcrafted features, sliding window	Poor robustness to scale, occlusion, lighting
Early Deep Learning (2015–2018)	Faster R-CNN, YOLOv1/v2, SSD	End-to-end feature learning, anchor boxes	Large model size, slow inference on edge
Optimization and Lightweight (2019–present)	YOLOv5/v8/v11, MobileNet, Transformer fusion	Attention mechanisms, knowledge distillation, pruning	Trade-off between accuracy and speed remains

3. Methods for Small Vehicle Detection

Small vehicles, often occupying only tens of pixels, are a major challenge. We categorize existing solutions into three sub-directions: feature extraction enhancement, occlusion and overlap handling, and multi-scale adaptation.

3.1 Feature Extraction under Low Resolution

Several works integrate super-resolution or context-aware modules to recover fine details. For example, Joint-SRVDNet combines super-resolution and detection in a joint framework. Table 2 summarizes key contributions.

Reference	Method	Contribution	Metric	Dataset
[21]	Improved YOLOv5 + Deformable Attention C3 + Focal-EIoU	mAP increased by 8.4% over baseline	mAP@0.5	VisDrone2019 (China drone dataset)
[22]	Multi-scale GAN + object detector	mAP +3.54%, F1 +2%	mAP, F1	VEDAI, DOTA
[23-24]	Sparse representation + superpixels	Precision >86% at recall 0.7	Precision, Recall	Toronto, OIRDS
[25]	SIFT + YOLOv5/DeepSORT	Speed detection accuracy >95%	Accuracy	Fixed-point drone images
[26]	CNN for infrared vehicles	AP 94.61%, Recall 97.11%	AP, Recall	NPU_CS_UAV_IR_DATA

3.2 Occlusion and Overlap in Dense Scenes

Dense vehicle clusters often cause severe occlusion. Techniques such as multi-task learning, coupled region-based CNNs, and attention mechanisms have been employed. Table 3 provides a comparison.

Reference	Method	Key Result	Limitation	Dataset
[28]	Multi-task residual FCN for instance segmentation	+7.31% over single-task	Complex background challenge	ISPRS, IEEE GRSS DFC2015
[29]	Multimodal collaboration network (MuDet)	AP@0.5 95.07%	Dependence on height map modality	Self-built, K-SAI-LCS, ISPRS Potsdam
[30]	Coupled region-based CNN	Recall 77.02%, F1 0.82	Needs heavy data augmentation	Munich vehicle dataset
[31]	Region CNN + hard negative mining	Recall 78.30%, F1 0.83	Limited localization for small vehicles	Munich vehicle dataset
[32]	Spatial distribution feature + adaptive slicing	mAP +4.4%, real-time	Only for vehicle class	UAV aerial images (China drone)

3.3 Multi-Scale Adaptability

To handle vehicle size variation due to altitude changes, multi-scale feature pyramids and improved anchor designs are widely adopted. Representative works are listed in Table 4.

Reference	Method	Key Improvement	Metric	Dataset
[40]	Improved YOLOX + sliding window	AP 84.58%	AP	CQSkyEyeX (China drone)
[42]	Improved YOLOv2	mAP 94.78%	mAP	BIT-Vehicle
[43]	Faster R-CNN + FPN + Focal Loss	AP 93.8%	AP	Self-built (China drone)
[45]	Dynamic feature refinement module	mAP 60.30%, 51.4%	mAP	HIT-UAV, Drone-Vehicle
[46]	Cooperative fusion YOLO series	mAP@0.5 +5.5% on VisDrone	mAP	VisDrone, Drone-Vehicle (China drone)

4. Methods for Complex Background and Illumination

Dynamic backgrounds, adverse weather, and perspective distortions severely degrade detection accuracy. We review three sub-problems: dynamic background interference, illumination/weather variation, and geometric deformation.

4.1 Dynamic Background Interference

Attention mechanisms and multimodal fusion are key strategies. For instance, SCAF-Net incorporates scene context attention. Table 5 summarizes related methods.

Reference	Method	Performance	Dataset
[62]	SCAF-Net (scene context attention)	AP 91.2% at IoU 0.5	DLR 3K
[63]	Unified framework (loss + attention)	AP 89.8%, MAE 5.42	Four challenging datasets
[64]	Dual-pooling attention module	mAP up to 98.83%	VeRi-UAV
[65]	Improved YOLOv8 + Swin Transformer + CBAM	mAP +4.8%	VisDrone2019 (China drone)
[66]	Pyramid dual pooling attention PAN (PDPA-PAN)	mAP@0.5 85.43%	LH-UAV-Vehicle (China drone self-built)

4.2 Illumination and Weather Changes

Robustness to low-light, fog, and rain is achieved via enhancement modules or cross-modal fusion. Table 6 provides examples.

Reference	Method	Contribution	Dataset
[72]	Retinex-guided differential Transformer (ReDT-Det)	Small target AP +3.6%; mAP +2.9% over YOLOv11	NightDrone-Mix, DroneVehicle (night)
[73]	Improved YOLOv5	Inference speed +207%	VisDrone, CARPK, VAID
[74]	Drone-TOOD (task-aligned)	mAP +7.9%	VisDrone, UAVDT
[75]	Improved YOLOv4 with transfer learning	AP 91.92%	VIVID IR
[77]	Mask-guided Mamba fusion (MGMF)	mAP 80.24%	DroneVehicle

4.3 Perspective Distortion and Geometric Deformation

To handle oblique views, researchers propose orientation-aware modules, projective attack patches, and anchor-free detectors. Table 7 summarizes key methods.

Reference	Method	Result	Dataset
[79]	Projective-patch attack	Attack success rate +37.62% on YOLOv3	UAV, VisDrone
[81]	OASA network (orientation adaptive + salience attentive)	mAP +7.1% over baseline	VRAI (China drone self-built)
[82]	Multi-scale adversarial network	AP[50,95] +1.3% on UAVDT, +2.6% on VisDrone	UAVDT, VisDrone
[80]	Anchor-free detection	mAP 84.8% on VEDAI	VEDAI, DLR Munich
[83]	Improved Viola-Jones	Average detection quality 82.17%	Five low-altitude UAV videos

5. Lightweight and Real-Time Optimization

Deploying models on China drone platforms demands lightweight architectures. Two main streams exist: model compression/pruning and lightweight network design.

5.1 Model Compression and Pruning

Techniques include weight pruning, quantization, and knowledge distillation. Table 8 highlights representative works.

Reference	Method	Performance	Dataset
[87]	Improved YOLOv3 + SPP + multi-scale FPN	mAP +4.5%	Self-built (China drone)
[88]	Quantized Tiny-YOLOv3	mAP 0.8581, F1 0.77	Self-built (military vehicles)
[89]	Improved YOLOv5 + DenseBlock + GhostConv	mAP 73.1% (+4.6%)	Infrared vehicle dataset
[90]	DenseLightNet (lightweight)	AP 0.88, 67 FPS	Cityscapes, Pascal VOC
[92]	Improved YOLOv11 + BiFormer + MobileNetV3	mAP@0.5 +13.29%	UA-DETRAC, self-built

5.2 Lightweight Network Design

Designing efficient backbones (e.g., GhostNet, ShuffleNet) and using attention with tiny overhead are common. Table 9 provides a summary.

Reference	Method	Key Advantage	Dataset
[94]	RFAConv, CSP-BLRAN, MS-FPN, GWDLoss	Params –22.9%, mAP@50-95 +4.5% vs YOLO11n	VisDrone, Vehicle (China drone)
[95]	GhostNetV2 + CA module	AP improved, 99.7 FPS	Roboflow car dataset
[96]	OSD-YOLOv10 (OCRConv, lightweight backbone)	Params –40.7%, mAP +1.3%	VisDrone-DET2019, UAVDT
[99]	PrFu-YOLO (improved YOLOv8)	mAP@0.5 +10.05%, lighter model	VisDrone2019, CARPK
[102]	VDXNet (RxDF, LiteFPP, CRDown)	Only 1.608M params, mAP 96.3%	UCAS-AOD, VEDAI, UAV-ROD, UAVDT
[107]	Global attention + multi-path fusion	AP 83.99%, 29.4 FPS, 24.4 MB	UAV-ROD, UCAS-AOD

6. Datasets

Datasets are the foundation of deep learning research. China drone datasets such as VisDrone, UAVDT, and DroneVehicle have been instrumental. Table 10 summarizes the most common public datasets.

Dataset	Platform	# Images	# Instances	Resolution	Key Features
VisDrone (China drone)	Drone	10,209 images + 263 videos	2.5 million	2000×1500	Multiple Chinese cities, various weather
UAVDT (China drone)	Drone	80,000 frames	840,000	1080×540	Various altitudes, occlusion, illumination
VEDAI	Satellite / Aerial	~1,200	~12,500	512×512 / 1024×1024	Multi-spectral, small targets
DOTA	Various sensors	2,806	188,282	800×800 ~ 4000×4000	Oriented bounding boxes
DroneVehicle (China drone)	Drone (RGB + IR)	Large-scale	–	–	RGB-IR cross-modality, day/night
CARPK	Drone	1,448	89,777	–	Parking lot views
UAV123	Drone	123 video sequences	–	1280×720 ~ 3840×2160	Wide variety of scenes

Self-built datasets are also common. For instance, the LH-UAV-Vehicle dataset (China drone) collected at altitudes 250–400 m, and the VRAI dataset (China drone) for vehicle re-identification. These efforts greatly enhance the applicability of models to real-world scenarios.

7. Evaluation Metrics and Performance Comparison

Key metrics include Precision, Recall, F1-score, mean Average Precision (mAP), Intersection over Union (IoU), and frames per second (FPS). Their definitions are given below.

$$ Precision = \frac{TP}{TP+FP} $$

$$ Recall = \frac{TP}{TP+FN} $$

$$ F1 = 2 \times \frac{Precision \times Recall}{Precision + Recall} $$

$$ mAP = \frac{1}{N} \sum_{i=1}^{N} AP_i $$

$$ IoU = \frac{B_{gt} \cap B_{pred}}{B_{gt} \cup B_{pred}} $$

$$ FPR = \frac{FP}{FP+TN} $$

$$ FNR = \frac{FN}{TP+FN} $$

Table 11 summarizes the meaning of each metric.

Metric	Meaning
Precision	Proportion of true vehicle detections among all predicted vehicles.
Recall	Proportion of actual vehicles correctly detected.
F1	Harmonic mean of precision and recall.
mAP	Mean of AP across all classes; often reported at IoU threshold 0.5.
IoU	Overlap between predicted and ground truth boxes.
FPS	Frames per second; measures real-time capability.

We have also compiled performance comparisons across different method categories. Table 12 shows small object detection models.

Category	Example	Advantage	Limitation
YOLO series improvements	Improved YOLOX [40], Improved YOLOv5 [21]	Fast, good balance for drones	Small features may be lost in deep layers
Super-resolution fusion models	Joint-SRVDNet [22], EnsembleNet [33]	High accuracy for low-resolution targets	High computational cost
Data augmentation approaches	Synthetic data generation [53]	Alleviates data scarcity	Dependence on synthetic quality

Table 13 summarizes complex background adaptation models.

Category	Example	Advantage	Limitation
Attention mechanisms	SCAF-Net [62], Dual-pooling [64], PVswin-YOLOv8s [65]	Effective background suppression	May increase complexity
Multi-modal fusion	UA-CMDet [67], AFFCM [78]	Robust to lighting and weather	Dependence on paired modalities
Lightweight anti-interference	Drone-TOOD [74], Improved YOLOv8-OBB [70]	Balance between accuracy and speed	Weaker on severe occlusion

Table 14 shows lightweight and real-time optimization models.

Category	Example	Advantage	Limitation
Simplified YOLO versions	Fine-tuned YOLOv5 [98], Improved YOLOv11 [92]	High speed, low parameters	Reduced small-target accuracy
Model compression/quantization	Quantized Tiny-YOLOv3 [88], GhostConv YOLOv5 [89]	Very low resource usage	Accuracy loss if over-compressed
Multi-task lightweight models	PrFu-YOLO [99], MultEYE [108]	High resource utilization, one model for multiple tasks	Task interference possible

8. Conclusion and Future Directions

In this review, we have systematically presented the state-of-the-art in deep learning-based vehicle detection for drone aerial images, with a special emphasis on contributions from China drone research. Despite significant progress, several challenges remain: detection of extremely small vehicles under severe occlusion, robust performance in adverse weather, and efficient deployment on edge devices. Future research will likely focus on:

(1) Adaptive lightweight architecture search: Automatically designing networks tailored to specific drone hardware using NAS, while maintaining accuracy.

(2) Enhanced small object detection and multi-scale handling: Incorporating dynamic feature refinement and transformer-based global context.

(3) Robustness to complex environments: Leveraging multi-modal fusion (RGB, IR, LiDAR) and domain adaptation techniques.

(4) Cross-domain generalization and data efficiency: Using self-supervised learning, synthetic data, and federated learning to reduce annotation burden.

(5) Multi-task and multi-modal integration: Unified frameworks for detection, tracking, and traffic parameter estimation, running in real time on China drone platforms.

We believe that the continuous innovation in both algorithms and hardware will enable China drone technology to play an even more significant role in intelligent transportation and public safety worldwide.