Advances of Vehicle Detection in China UAV Aerial Images Based on Deep Learning

As the technology of unmanned aerial vehicles (UAVs) continues to mature, vehicle detection in aerial imagery captured by China UAV platforms has become a critical research frontier in intelligent transportation monitoring, autonomous driving, and public safety. The unique advantages of China UAVs—such as high flexibility, low cost, and wide-area coverage—enable efficient data acquisition for traffic flow analysis, emergency response, and smart city management. In recent years, deep learning-based methods have revolutionized this field, shifting from handcrafted feature extractors to end-to-end trainable neural networks that automatically learn hierarchical representations. Despite significant progress, challenges persist: small object detection in high-altitude images, severe occlusion in dense traffic, complex background interference due to variable illumination and weather, and the need for lightweight models deployable on resource-constrained UAV platforms. This article provides a comprehensive overview of the state-of-the-art deep learning methods for vehicle detection in China UAV aerial images, highlighting key technological breakthroughs, benchmarking datasets, evaluation metrics, and future research directions.

The evolution of vehicle detection in China UAV aerial images can be divided into three distinct phases. The early stage relied on traditional image processing techniques such as Histogram of Oriented Gradients (HOG) and Scale-Invariant Feature Transform (SIFT), combined with sliding windows and classifiers like Support Vector Machines (SVM). These methods achieved acceptable results in static, simple scenes (e.g., parking lot counting) but suffered from poor generalization under varying viewpoints, illumination, and small target scales. The second phase began with the adoption of convolutional neural networks (CNNs), notably Faster R-CNN, YOLO, and SSD. These deep learning models introduced automatic feature extraction and end-to-end prediction, significantly improving detection accuracy. However, the two-stage detectors (e.g., Faster R-CNN) offered high precision at the cost of speed, while single-stage detectors (e.g., YOLO) prioritized real-time performance but struggled with small objects and occlusions. The third and current phase is characterized by extensive innovation: Transformer architectures fuse with CNNs for global feature modeling, lightweight networks like MobileNet and PrFu-YOLO enable edge deployment, multi-modal fusion integrates RGB and infrared data, and anchor-free detectors simplify geometric adaptation. Moreover, training paradigms such as self-supervised learning and neural architecture search (NAS) are being explored to overcome data scarcity and model optimization challenges.

**Table 1. Key Phases of Vehicle Detection in China UAV Aerial Imagery**
Phase	Period	Core Methods	Strengths	Limitations
Traditional Methods	Before 2015	HOG+SVM, SIFT, Viola-Jones	Simple, interpretable for limited scenes	Sensitive to scale, illumination; poor generalization
Deep Learning Preliminary	2015–2019	Faster R-CNN, YOLOv3, SSD	Automatic feature extraction; improved accuracy	High computational cost; limited small target handling
Optimization & Integration	2020–present	Transformer+CNN, YOLOv8/11, multi-modal fusion, anchor-free, lightweight	High precision, real-time, robust to complex environments, deployable on edge	Data dependency, extreme conditions still challenging

The core technical advancements in China UAV vehicle detection can be categorized into three problem domains: small target detection, complex background adaptation, and model lightweighting with real-time optimization. For small target detection, the main challenge is the loss of discriminative features due to downsampling in deep networks. Methods include adding dedicated small-object detection layers, super-resolution pre-processing, and exploiting multi-scale feature pyramids. For instance, the Joint-SRVDNet integrates super-resolution reconstruction with detection to enhance low-resolution vehicles, while an improved YOLOX with an additional detection head and deformable attention modules achieves an 8.4% mAP improvement on the VisDrone dataset. Dense target occlusion and overlap are addressed by residual learning-based instance segmentation networks and multi-task frameworks that jointly predict vehicle attributes and boundaries. The multimodal collaboration network (MuDet) fuses RGB and height maps, achieving 95.07% AP@0.5 on challenging datasets. Multi-scale adaptability is tackled through feature pyramid networks (FPN), bidirectional FPN, and K-means++ anchor clustering. Representative work includes an improved YOLOv2 with multi-layer fusion that reaches 94.78% mAP on BIT-Vehicle, and an LD-CNNs model with generative adversarial network (GAN) data augmentation achieving 86.9% mAP on Munich dataset.

**Table 2. Summary of Small Target Vehicle Detection Methods in China UAV Imagery**
Reference	Method	Key Contribution	Dataset	Performance
[21]	Improved YOLOv5 + DAC + Focal-EIoU	Added small target detection layer, deformable attention C3	VisDrone2019	mAP +8.4%
[22]	Joint-SRVDNet	Joint super-resolution and detection network	VEDAI, DOTA	mAP +3.54%, F1 +2%
[28]	Multi-task residual FCN	Vehicle instance segmentation, separates touching vehicles	ISPRS, IEEE GRSS DFC2015	Performance +7.31%
[30]	Coupled region-based CNN	Simultaneous vehicle proposal and attribute learning	Munich vehicle dataset	Recall 77.02%, F1=0.82
[40]	Improved YOLOX + ASFF + CA	Adaptive spatial feature fusion and coordinate attention	CQSkyEyeX	Detection accuracy 84.58%
[42]	Improved YOLOv2 + K-means++	Multi-layer fusion and anchor clustering	BIT-Vehicle	mAP 94.78%
[46]	Synergistic fusion YOLO	Multi-scale aggregation with small computational cost	VisDrone, Drone-Vehicle	mAP@0.5 +5.5%
[49]	LD-CNNs + MC-GAN	Lightweight network with generative data augmentation	Munich, self-built	mAP 86.9%, F1=0.875

To cope with dynamic background interference, illumination changes, and geometric distortions, researchers have proposed various attention mechanisms and multi-modal fusion strategies. The Scene Context Attention-based Fusion Network (SCAF-Net) enhances detection by incorporating scene context, achieving 91.2% AP on DLR 3K. The dual-pooling attention module (DPAM) strengthens local vehicle features, reaching mAP values exceeding 95% on UAV re-identification datasets. An improved YOLOv8-OBB with large selective kernel attention (LSKAM) reduces computational cost to 26.9 GFLOPs while maintaining 73.7% mAP. For lighting and weather robustness, the ReDT-Det network integrates Retinex-guided enhancement with Transformer detection, improving small target accuracy by 3.6% on a self-built night dataset. Multi-modal fusion using RGB and infrared data (e.g., UA-CMDet, MGMF, AFFCM) has shown substantial improvements: DroneVehicle dataset experiments report up to 80.24% mAP with Mamba-based fusion. Perspective distortion is addressed by projection-patch attack analysis and orientation-adaptive dynamic convolution (OASA), which extracts orientation-invariant features and achieves 7.1% mAP gain on the VRAI dataset.

**Table 3. Representative Methods for Complex Background and Illumination Adaptation in China UAV Vehicle Detection**
Reference	Method	Key Innovation	Dataset	Performance
[62]	SCAF-Net	Scene context attention fusion	DLR 3K	AP@0.5 = 91.2%
[64]	Dual-pooling attention module (DPAM)	Channel + spatial pooling attention for local features	VeRi-UAV	mAP > 95%
[66]	PDPA-PAN (Pyramid Dual Pooling Attention PAN)	High-altitude dataset LH-UAV-Vehicle, dual attention	LH-UAV-Vehicle	mAP50 = 85.43%
[67]	UA-CMDet (Uncertainty-Aware Cross-Modality)	RGB-IR fusion with uncertainty estimation	DroneVehicle	mAP +16.10% (RGB) / +4.86% (IR)
[70]	Improved YOLOv8-OBB + LSKAM	Large selective kernel attention, lightweight neck	DroneVehicle	mAP 73.7%, 26.9 GFLOPs
[72]	ReDT-Det (Retinex-guided Differential Transformer)	Nighttime enhancement + Transformer detection	NightDrone-Mix	Small target AP +3.6%
[77]	MGMF (Mask Guided Mamba Fusion)	Mask regularization + state-space fusion for RGB-IR	DroneVehicle	mAP 80.24%
[78]	AFFCM (Adaptive Multimodal Feature Fusion + Cross-Modal Index)	RGB-TIR fusion with cross-modal indexing	DroneVehicle	mAP +14.44% (RGB) / +5.02% (TIR)
[81]	OASA (Orientation Adaptive & Salience Attentive)	Orientation-adaptive dynamic convolution + Trans-Attn	VRAI (largest UAV vehicle Re-ID)	mAP +7.1%

Model lightweighting and real-time optimization are essential for deploying deep networks on China UAV platforms where computational resources and power are strictly limited. Two primary strategies have been adopted: model compression/pruning and lightweight network architecture design. Pruning techniques remove redundant weights and channels without significant accuracy loss. For example, the DenseLightNet reduces floating-point operations dramatically while achieving 67 FPS detection speed. Quantized Tiny-YOLOv3 runs on edge devices with a mAP of 0.8581 and 0.77 F1 score, suitable for military vehicle detection in real time. Lightweight networks employ depthwise separable convolutions (MobileNet), shuffling operations (ShuffleNet), and efficient modules like GhostConv. Improved YOLOv5 with GhostConv and DenseBlock boosts mAP by 4.6% on infrared vehicle data. The PrFu-YOLO based on YOLOv8 achieves a 10.05% mAP improvement while being more compact. Cross-stage partial fusion and dual-layer routing attention (CSP-BLRAN) reduce parameters by 22.9% with accuracy gains over YOLO11n. Multi-task learning frameworks like MultEYE simultaneously handle detection, tracking, and speed estimation, achieving 91.4% faster speed than prior SOTA.

**Table 4. Summary of Lightweight and Real-Time Vehicle Detection Models for China UAV Edge Deployment**
Reference	Method	Key Techniques	Performance	Dataset
[87]	Improved YOLOv3 + SPP	Spatial pyramid pooling, multi-scale convolution pyramid	mAP +4.5%	Self-built aerial dataset
[88]	Quantized Tiny-YOLOv3	Weight quantization for edge device	mAP=0.8581, F1=0.77	Self-built military dataset
[89]	Improved YOLOv5 + DenseBlock + GhostConv	Dense connections, ghost convolution, channel attention	mAP=73.1% (+4.6%)	Infrared vehicle dataset
[90]	DenseLightNet	Lightweight design with reduced FLOPs	AP=0.88, 67 FPS	Cityscapes + Pascal VOC
[92]	Improved YOLOv11 (FAE_V11s)	BiFormer attention + MobileNetV3 backbone	mAP@0.5 +13.29%	UA-DETRAC + self-built
[94]	RFAConv + CSP-BLRAN + MS-FPN + GWDLoss	Receptive-field attention, dual-layer routing, multi-layer selective fusion	Params -22.9%, mAP@50-95 +4.5%	VisDrone, Vehicle
[95]	YOLOv5-R (GhostNetV2 + CA)	Lightweight modules for speed and accuracy	Accuracy and speed improved, 99.7 FPS	Roboflow car dataset
[96]	OSD-YOLOv10	Online convolutional re-param (OCRConv), dual small target layer	Params -40.7%, mAP +1.3%	VisDrone-DET2019, UAVDT
[108]	MultEYE (Multi-task)	Detection + tracking + speed estimation on edge	91.4% faster than SOTA	Self-built

Public datasets play a crucial role in benchmarking and advancing vehicle detection techniques for China UAVs. Key datasets include VisDrone (over 2.6 million object annotations across multiple Chinese cities), UAVDT (80,000 frames with weather and occlusion attributes), VEDAI (multi-spectral images with small vehicles), DOTA (large-scale with oriented bounding boxes), CARPK (parking lot vehicle counting), and UAV123 (tracking sequences). Table 5 summarizes the characteristics of these datasets. In addition, many researchers construct self-built datasets tailored to specific scenarios, such as high-altitude (LH-UAV-Vehicle), nighttime (NightDrone-Mix), dense occlusion (multi-modal MuDet), and military vehicle detection (Armed_vehicle). These self-collected datasets often incorporate local geographic features and extreme conditions, filling gaps left by public benchmarks and enhancing model generalization in real-world China UAV operations.

**Table 5. Common Public Datasets for China UAV Vehicle Detection**
Dataset	Perspective	Platform	Altitude (m)	Image Count	Annotation Count	Image Size (pixels)
VisDrone2019	Various	China UAV	10–150	10,209	2.5M	2000×1500
UAVDT	Front, side, top	China UAV	10–70	80,000	840K	1080×540
VEDAI	Top	Satellite/China UAV	–	~1,200	~12.5K	512×512 / 1024×1024
DOTA	Top	Various sensors	–	2,806	188K	800×800 to 4000×4000
CARPK	Top	China UAV	~40	1,448	89K	–
UAV123	Various	China UAV	5–25	123 videos	–	1280×720 to 3840×2160
DroneVehicle	Various	China UAV & IR	–	~20,000	~200K	–

Evaluation metrics for vehicle detection in China UAV aerial images include Precision, Recall, F1-score, mean Average Precision (mAP), Intersection over Union (IoU), False Positive Rate (FPR), False Negative Rate (FNR), and Frame Per Second (FPS) for real-time assessment. The mathematical formulations are:

$$ \text{Precision} = \frac{TP}{TP+FP} $$

$$ \text{Recall} = \frac{TP}{TP+FN} $$

$$ F1 = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}} $$

$$ \text{mAP} = \frac{1}{N} \sum_{i=1}^{N} AP_i $$

$$ \text{IoU} = \frac{|B_{gt} \cap B_{pred}|}{|B_{gt} \cup B_{pred}|} $$

$$ \text{FPR} = \frac{FP}{FP+TN} $$

$$ \text{FNR} = \frac{FN}{TP+FN} $$

Comprehensive performance comparisons across different algorithm families are provided in Tables 6, 7, and 8, which cover small object detection models, complex background adaptation models, and lightweight/real-time models. These tables highlight the trade-offs between accuracy, speed, and computational cost, guiding the selection of suitable methods for various China UAV deployment scenarios.

**Table 6. Performance Comparison of Small Object Detection Models for China UAV Vehicle Detection**
Category	Example Work	Key Principle	Strengths	Limitations	Applicable Scenarios
YOLO-series improved	Improved YOLOX [40]	Added detection layer, anchor optimization, attention	Fast, accurate	FNR high in extreme occlusion	Real-time traffic monitoring
Feature enhancement + super-resolution	Joint-SRVDNet [22]	Integrates SR with detection	High accuracy for low-resolution	Computationally heavy	High-precision tasks
Data augmentation	Krump & Stütz [53]	Optimized synthetic data generation	Improves sample imbalance	Relies on synthetic quality	Scarce data scenarios

**Table 7. Performance Comparison of Complex Background Adaptation Models for China UAV Vehicle Detection**
Category	Example Work	Key Principle	Strengths	Limitations	Applicable Scenarios
Attention mechanism	SCAF-Net [62], DPAM [64]	Scene context/spatial-channel attention	Suppress background, enhance target	High complexity	Dynamic backgrounds
Multi-modal fusion	UA-CMDet [67], MGMF [77]	RGB+IR/TIR integration	Robust to illumination/weather	Requires dual sensors	Night, fog, adverse weather
Lightweight anti-interference	Improved YOLOv8-OBB [70], Drone-TOOD [74]	Lightweight backbone + task decomposition	Balance accuracy and speed	Small object detection weaker	Resource-limited monitoring

**Table 8. Performance Comparison of Lightweight and Real-Time Models for China UAV Edge Deployment**
Category	Example Work	Key Principle	Strengths	Limitations	Applicable Scenarios
YOLO lightweight variants	Fine-tuned YOLOv5 [98], FAE_V11s [92]	Network simplification, GhostConv, MobileNet backbone	Fast, low parameters	Reduced small-target accuracy	Real-time UAV monitoring
Compression & quantization	Quantized Tiny-YOLOv3 [88], DenseLightNet [90]	Weight pruning, FP32→INT8	Small model size, fast inference	Accuracy loss after compression	Edge devices, low-power UAVs
Multi-task lightweight	PrFu-YOLO [99], MultEYE [108]	Shared backbone for detection+tracking+speed	Resource efficient, multi-function	Task interference possible	Integrated traffic analysis

Despite the remarkable progress, several challenges remain that demand future research efforts. First, the detection accuracy of extremely small vehicles (e.g., 10×10 pixels) in high-altitude China UAV images is still unsatisfactory due to severe feature dilution. Future work should explore self-supervised representation learning and generative data augmentation to improve small-object feature retention. Second, complex environments involving rain, fog, low illumination, and dynamic motion call for more robust domain adaptation techniques and advanced multi-modal fusion frameworks that integrate not only visible and infrared but also LiDAR and radar data. Third, model lightweighting must be advanced further to achieve a Pareto frontier between accuracy and computational cost. Adaptive neural architecture search (NAS) and knowledge distillation can help automate the design of efficient architectures tailored to specific China UAV hardware. Fourth, cross-domain generalization remains an open problem—models trained in one city often fail in another due to differences in vehicle types, road layouts, and lighting. Unsupervised domain adaptation (UDA) and meta-learning are promising directions to mitigate these gaps. Fifth, the privacy and security of China UAV data need attention; federated learning could be employed to train models across distributed edge nodes without sharing raw images. Finally, the collaboration of multiple China UAVs (swarm detection) can provide broader coverage and higher redundancy, but coordination and fusion of multi-view detection results introduce new algorithmic challenges. By addressing these issues, vehicle detection in China UAV aerial images will not only achieve higher accuracy and reliability but also enable transformative applications in smart transportation, disaster response, and autonomous systems.