A High-Precision YOLO-Based Model for Small UAV Object Detection

With the extensive adoption of Unmanned Aerial Vehicles (UAVs) across sectors such as agriculture, logistics, surveillance, and disaster response in China and globally, the demand for efficient and accurate target detection technology has escalated. UAVs, renowned for their high mobility and flexibility, can access areas challenging for traditional methods, thereby enabling novel applications. Target detection, a cornerstone of computer vision, underpins autonomous UAV navigation and environmental perception, directly influencing operational efficacy in real-world scenarios.

Despite significant advancements propelled by deep learning, target detection from a China UAV drone perspective remains fraught with challenges: minuscule object sizes, low resolution, dynamic and cluttered backgrounds, and abrupt viewpoint variations. Furthermore, the computational resources on UAV platforms are often constrained, making the deployment of complex detection models impractical. Consequently, developing small-object detection algorithms that balance precision and efficiency is paramount for enhancing the utility of China UAV drone applications.

Related Work

Contemporary object detection algorithms are broadly categorized into two-stage and one-stage detectors. Two-stage methods, exemplified by the R-CNN series, first generate region proposals and then perform classification and regression. While achieving high accuracy, they suffer from high computational cost and slow inference speed, making them less suitable for real-time China UAV drone operations. Representative models include Faster R-CNN, Mask R-CNN, and Cascade R-CNN.

One-stage detectors, such as YOLO and SSD, frame detection as a single regression problem, offering superior speed and lower resource consumption—a critical advantage for UAV deployment. The evolution from YOLOv1 to YOLOv8 has consistently focused on improving the speed-accuracy trade-off through architectural innovations like multi-scale prediction, residual connections, CSPDarknet, and advanced feature pyramids like PANet. However, as model complexity increases, performance bottlenecks in detecting small, dense objects persist.

Specialized deep learning models for UAV small object detection have emerged. For instance, YOLODrone added extra detection layers for multi-scale objects but increased complexity. SyNet improved accuracy through multi-stage feature fusion at the cost of higher training overhead. RSOD leveraged shallow features and attention mechanisms to enhance small object perception but increased model size. More recent approaches like MS-YOLOv7 and DC-YOLOv8 integrate Transformer architectures and refined feature fusion networks, yielding better accuracy but often accompanied by parameter explosion and deployment difficulties. A common thread among existing methods is the trade-off between improved small-target performance and increased computational burden, which remains a significant hurdle for resource-limited China UAV drone platforms.

The Proposed Model: YOLO-LiteMax

Overview

The proposed YOLO-LiteMax model is an enhanced version based on the YOLOv8 architecture. Its overall framework comprises a backbone for multi-scale feature extraction, an improved neck for deep feature fusion and scale enhancement, and a novel detection head for final classification and regression predictions. The input image is processed by the backbone, and the extracted multi-scale features are fed into the refined neck structure. Finally, the shared convolutional detection head generates the output predictions.

Selective Convolution Block (SCB)

UAV aerial imagery, especially from China UAV drone deployments in complex urban or natural environments, often contains vast areas of intricate background, while critical small targets occupy only tiny regions. Standard convolutional networks processing such images generate a plethora of redundant features describing the background. This not only increases computational overhead but can also drown out the weak signals from small targets. To address this, we introduce the Selective Convolution Block (SCB), designed to process feature information selectively for enhanced efficiency and precision.

The core idea of SCB is inspired by partial convolution, as proposed in FasterNet. The design philosophy is to perform spatial feature extraction on only a subset of input channels while leaving the remaining channels unchanged. For a partial convolution operation, the computational complexity is approximately proportional to the output feature map dimensions $ h \times w $, kernel size $ k \times k $, and the number of channels processed $ p \cdot c $, where $ p $ is the fraction of channels processed. The FLOPs can be approximated as:

$$ \text{FLOPs} \approx p \cdot h \cdot w \cdot k^2 \cdot c $$

For instance, with $ p = 1/4 $, the theoretical computation is only 1/16th of a full convolution, while the information from non-convolved channels is preserved to ensure feature integrity.

In the proposed SCB, the input feature map first passes through a CBS module (Conv-BatchNorm-Silu) for channel expansion. It is then split evenly into two parts. One part is directly retained as an identity path. The other part is fed into a stack of FasterNet blocks for selective feature extraction. Finally, features from both paths are concatenated along the channel dimension and integrated via another CBS module. This mechanism efficiently filters redundant background information while strengthening the identification of crucial features relevant to China UAV drone small target detection.

Small Target Scale Sequence Fusion (STSSF)

Objects in China UAV drone imagery exhibit drastic scale variations. Traditional feature fusion structures suffer from two main flaws: 1) They rely on simple concatenation or addition between adjacent layers, failing to establish deep cross-scale correlations, which can lead to missed detections. 2) The upsampling process tends to cause the loss of detailed information from high-resolution shallow features, particularly detrimental for small object localization. To overcome these issues, we design a novel STSSF structure. It constructs an efficient cross-scale information flow through two core modules: 3D Convolution Scale Fusion (C3DSF) and Scale Aware Concatenation (SAC), and introduces a higher-resolution P2 feature layer.

To tackle the lack of deep cross-scale correlation, we design the C3DSF module. It aligns the channel numbers and spatial dimensions of feature maps from different scales (e.g., $ F_l, F_m, F_s $), stacks them along a new depth dimension, and applies 3D convolution to extract scale-sequence features. The process can be formalized as follows. First, features are aligned and processed:

$$ F’_i = \text{Process}(F_i) $$

They are then stacked along the depth dimension:

$$ \text{Stacked\_Input} = \text{Concat3D}(F’_l, F’_m, F’_s) $$

Finally, a unified 3D convolutional operation is applied:

$$ \text{Output} = \text{ReLU}(\text{BatchNorm3d}(\text{Conv3d}(\text{Stacked\_Input}))) $$

By extending and stacking features along the depth dimension, this module enables more effective information fusion across different scales, analogous to how 3D convolutions capture temporal context in video sequences.

To address the loss of shallow detail information during upsampling, especially in areas dense with small targets, a more refined fusion mechanism is needed. We introduce the Scale Aware Concatenation (SAC) module. Its computational process for fusing large ($F_l$), medium ($F_m$), and small ($F_s$) scale features is:

$$ F’_l = \text{MaxPool}(F_l) + \text{AvgPool}(F_l) $$

$$ F’_s = \text{Upsample}(F_s) $$

$$ F_{\text{fuse}} = \text{Concat}(F’_l, F_m, F’_s) $$

Here, $\text{MaxPool}$ and $\text{AvgPool}$ denote max pooling and average pooling, respectively, and $\text{Upsample}$ is an upsampling operation. This method effectively balances detail and global information.

Furthermore, to enhance shallow information from the source, the STSSF structure reuses the C3DSF module to efficiently fuse features from the introduced P2 layer with P3 and P4. This strategy significantly boosts the model’s perception of tiny objects and its robustness in complex backgrounds, crucial for China UAV drone applications, without a substantial increase in computational burden.

Shared Convolution Precision Detection Head (SCPD)

In China UAV drone detection tasks, the model must precisely locate and classify objects across multiple scales simultaneously. Mainstream detectors like YOLOv8 employ independent detection heads, where separate convolution layers are dedicated to each feature scale. This design poses two problems for UAV detection. First, it leads to parameter redundancy and inconsistent learning because each head learns independently; the imbalanced distribution of small object samples across scales can cause conflicting feature representations for the same class of objects. Second, the relied-upon Batch Normalization is sensitive to batch size, and its performance degrades under the memory-constrained, small-batch training scenarios common on UAV platforms.

To solve these problems and improve the consistency and robustness of multi-scale feature representation, we design a novel Shared Convolution Precision Detection Head (SCPD). Its core innovations are task-specific parameter sharing across scales, the introduction of Group Normalization, and learnable scale adaptation.

First, to address the instability of BatchNorm with small batch sizes, we replace it with Group Normalization (GN) in the detection head. GN’s computation is independent of batch size, providing more stable performance. The key design is cross-scale parameter sharing. Specifically, feature maps from all scales share the same set of convolution weights for the final classification branch and another shared set for the regression branch before generating their respective outputs. This “within-task sharing” strategy forces the model to learn scale-invariant general features, enhancing multi-scale consistency while significantly reducing the number of parameters in the detection head.

However, forcing strict sharing may neglect the unique characteristics of each scale. Therefore, we introduce a learnable scale factor after the shared convolutions to adaptively adjust the contribution weight of feature maps from each scale. This maintains feature consistency while preserving flexibility to scale variations, which is vital for handling the diverse object sizes encountered by China UAV drones.

Experimental Results and Analysis

Dataset

This study employs the VisDrone2019-DET dataset for validation. Collected by the AISKYEYE team at Tianjin University in China, it encompasses UAV images from multiple scenarios and viewpoints, containing ten object categories like pedestrians and vehicles. Following the COCO definition and considering practical China UAV drone application needs, objects occupying less than 1% of the image area or with dimensions smaller than 32×32 pixels are defined as small targets. Statistics show that over 70% of targets in this dataset have width and height less than 50 pixels, presenting a typical characteristic of densely distributed small objects, making it suitable for evaluating small-target detection models.

Experimental Setup and Evaluation Metrics

Experiments were conducted on Ubuntu 20.04 using Python 3.9, PyTorch 2.2.2, and CUDA 12.1. Hardware included an Intel Core i9-13900KF processor and an NVIDIA RTX 4090 GPU with 24GB VRAM. Key training parameters were: 400 epochs, batch size of 8, input image size 640×640, SGD optimizer with an initial learning rate of 0.01 and weight decay of 0.0005.

Evaluation metrics include Precision, Recall, mean Average Precision (mAP), parameters, and Frames Per Second (FPS). Precision measures the accuracy of positive predictions. Recall evaluates the ability to identify all positive samples. The primary metric, mAP@0.5, computes the average precision over all classes at an Intersection over Union (IoU) threshold of 0.5. AP_small specifically measures the average precision for small objects. Parameters count the total learnable parameters, and FPS indicates inference speed.

Comparative Experimental Results

To validate the effectiveness of the proposed YOLO-LiteMax model, we compared it comprehensively with YOLO series models, other lightweight models, and mainstream detectors on the VisDrone2019-DET dataset.

The results are summarized in Table 1. YOLO-LiteMax achieves a mAP@0.5 of 45.2% and an AP_small of 15.3%, using only 6.1M parameters and 30.0G FLOPs, demonstrating excellent detection accuracy and deployment efficiency. Notably, its AP_small is significantly higher than many models optimized for small targets, indicating superior perception and feature extraction capabilities for China UAV drone imagery.

Model	mAP@0.5 (%)	AP_small (%)	FLOPs (G)	Params (M)	FPS
YOLOv5s	33.2	11.2	15.8	7.0	125
YOLOv5m	36.3	12.6	48.0	20.9	52
YOLOv8s	39.3	12.5	28.5	11.1	131
YOLOv8m	44.0	13.3	78.7	25.9	87
YOLOv10s	41.1	12.2	21.4	7.2	238
YOLOv10m	44.4	12.2	58.9	15.3	122
YOLOv11s	40.3	12.4	21.3	9.4	212
PC-YOLO11s	43.8	–	–	7.1	–
EdgeYOLO-S	44.8	–	109.1	40.5	34
Drone-YOLO-N	38.1	12.0	–	3.1	–
Drone-YOLO-S	44.3	14.2	–	10.9	–
VAMYOLOX-T	28.3	–	14.5	5.7	267
YOLO-NAS-S	42.5	13.1	–	19.0	38
PP-YOLOE-S	39.9	13.9	17.36	7.9	46
Modified-YOLOv8	42.2	–	9.7	19.2	143
PVswin-YOLOv8s	43.3	–	–	41.8	–
YOLO-LiteMax (Ours)	45.2	15.3	30.0	6.1	118

Compared to the baseline YOLOv8s, YOLO-LiteMax improves mAP@0.5 by 5.9% and AP_small by 2.8%, while also outperforming YOLOv5 and YOLOv10 models of similar scale. Against advanced models like YOLO-NAS-S, our model surpasses them in accuracy while using only about one-third of the parameters. Compared to speed-oriented models like VAMYOLOX-T, YOLO-LiteMax offers a decisive advantage in precision. When competing with other lightweight models like PC-YOLO and EdgeYOLO, our model achieves comparable or superior performance with lower computational cost. Most notably, compared to the extremely lightweight Drone-YOLO-N, YOLO-LiteMax achieves a substantial 7.1 percentage point gain in mAP@0.5 with a reasonable parameter increase, showcasing the exceptional “performance per parameter” ratio of our proposed modules.

As shown in Table 2, compared to classical detectors and modern Transformer-based models, YOLO-LiteMax maintains a strong competitive edge, particularly in efficiency, which is critical for China UAV drone platforms.

Model	mAP@0.5 (%)	AP_small (%)	FLOPs (G)	Params (M)
SSD	10.6	3.2	31.4	34.0
Faster R-CNN	37.2	15.4	118.8	41.4
Cascade R-CNN	39.1	13.5	189.1	69.0
RetinaNet	19.1	6.3	36.4	35.7
CenterNet	33.7	11.5	192.2	70.8
EfficientDet	21.2	8.5	55.0	20.7
Swin Transformer	35.6	12.0	44.5	34.2
RT-DETR-L	45.0	14.8	103.5	32.0
DMNet	43.6	14.2	101.7	39.4
YOLO-LiteMax (Ours)	45.2	15.3	30.0	6.1

Ablation Study

A series of ablation experiments were conducted on the VisDrone2019-DET dataset to verify the effectiveness of each proposed module, with results presented in Table 3.

Baseline	SCB	STSSF	SCPD	Precision (%)	Recall (%)	mAP@0.5 (%)	AP_small (%)	Params (M)	FPS
YOLOv8s				50.9	38.2	39.3	12.5	11.1	131
√	√			50.5	38.3	39.5	12.2	8.3	156
√	√	√		55.8	41.7	44.2	14.8	6.9	124
√	√	√	√	56.3	42.7	45.2	15.3	6.1	118

The results clearly demonstrate the contribution of each module. Introducing SCB to the baseline leverages its partial convolution mechanism to reduce redundant computation, boosting inference speed from 131 to 156 FPS and reducing parameters by 2.8M. The slight mAP increase of 0.2% confirms SCB’s dual benefit of lowering cost and improving feature quality. Adding the STSSF structure brings a dramatic performance leap: mAP@0.5 and AP_small increase by 4.7% and 2.6%, respectively. This surge is attributed to STSSF’s powerful cross-scale correlation built via 3D convolution, greatly enhancing feature representation for small targets. Although this refined fusion reduces FPS to 124, the trade-off of roughly 20% speed for a qualitative improvement in detection is justified. Finally, incorporating SCPD to form the complete model yields a further 1.0% gain in mAP@0.5 with FPS stabilizing at 118, thanks to its cross-scale consistency enhancement with minimal extra computation.

In summary, the final YOLO-LiteMax model achieves a significant 5.9% improvement in mAP@0.5 and a 2.8% gain in the critical AP_small metric at the cost of approximately 10% inference speed relative to the baseline. This demonstrates that our improvements constitute an efficient optimization of model performance while maintaining high real-time capability, showing excellent deployment potential for China UAV drone systems.

Visualization Analysis

To intuitively demonstrate the detection effectiveness of our method, we conducted inference experiments comparing YOLO-LiteMax, YOLOv8s, and YOLOv11s on representative scenes from the VisDrone2019-DET dataset containing numerous small objects, simulating typical China UAV drone monitoring scenarios. The visualization results clearly show that YOLO-LiteMax successfully detects more small and occluded targets (e.g., distant motorcycles, pedestrians in crowds, cyclists on crosswalks) that are missed by the other models, while also reducing misclassifications (e.g., avoiding labeling a traffic sign as a car). These visual comparisons corroborate the quantitative metrics, confirming the superior small-target detection capability and robustness of the proposed model in complex aerial environments.

Conclusion

Addressing the challenge of balancing accuracy and efficiency in detecting small objects from UAV aerial imagery, this paper proposed an improved lightweight detection model named YOLO-LiteMax. By introducing the Selective Convolution Block (SCB) to reduce computational redundancy, designing the Small Target Scale Sequence Fusion (STSSF) structure to enhance cross-scale feature representation, and constructing the Shared Convolution Precision Detection Head (SCPD) to improve detection consistency, the model significantly enhances the overall performance of YOLOv8. Experimental results on the VisDrone2019-DET dataset demonstrate that YOLO-LiteMax achieves excellent detection accuracy, showing clear advantages over the baseline and various mainstream algorithms, especially in small object recognition, validating the effectiveness of our approach for China UAV drone applications.

Although our model achieves a substantial gain in detection precision at the cost of a minor decrease in inference speed, representing an outstanding “performance-efficiency ratio,” pursuing ultimate real-time performance remains a direction for future work. The next steps will focus on model quantization and pruning techniques to further compress the model and increase inference speed, aiming for higher-performance deployment on extremely resource-constrained edge computing platforms common in China UAV drone ecosystems.