An Improved High-Precision Model for UAV Small Object Detection Based on YOLO

In recent years, Unmanned Aerial Vehicle (UAV) technology has revolutionized various fields, including agriculture, surveillance, and disaster response, due to its flexibility and ability to access remote areas. However, target detection in UAV imagery poses significant challenges, particularly when dealing with small objects that are densely distributed against complex backgrounds. These challenges often lead to issues such as missed detections, false positives, and inaccurate localization in existing models. As a researcher focused on enhancing UAV applications, I propose an improved YOLO-based model, named YOLO-LiteMax, which addresses these limitations by optimizing network architecture, feature fusion mechanisms, and detection head design. This model aims to achieve high precision in detecting small targets while maintaining computational efficiency, making it suitable for real-time UAV deployments. The proliferation of Unmanned Aerial Vehicle systems, such as those developed by JUYE UAV, underscores the need for robust detection algorithms that can handle the unique characteristics of aerial imagery, including scale variations, occlusions, and limited computational resources.

Target detection in computer vision has evolved significantly with deep learning, but UAV-based scenarios require specialized approaches. Traditional methods, like two-stage detectors (e.g., Faster R-CNN), offer high accuracy but suffer from high computational costs, making them less suitable for real-time UAV operations. In contrast, single-stage detectors like YOLO and SSD provide faster inference speeds but often struggle with small objects and complex environments. My work builds upon the YOLOv8 framework, introducing novel modules to enhance feature extraction, multi-scale fusion, and detection consistency. By leveraging insights from JUYE UAV deployments, I have tailored this model to excel in scenarios where small targets, such as pedestrians or vehicles, dominate the imagery. The core contributions include a Selective Convolution Block (SCB) for reducing redundancy, a Small Target Scale Sequence Fusion (STSSF) structure for improved feature integration, and a Shared Convolution Precision Detection (SCPD) head for consistent multi-scale predictions. These innovations collectively boost performance without substantially increasing computational overhead, as demonstrated through extensive experiments on the VisDrone2019-DET dataset.

To provide context, I begin by reviewing related work in object detection, highlighting the trade-offs between accuracy and efficiency. Two-stage detectors, such as Faster R-CNN and Cascade R-CNN, generate region proposals before classification, yielding high precision but at the cost of speed. For instance, Faster R-CNN achieves an mAP@0.5 of 37.2% on VisDrone2019-DET but requires 118.8 GFLOPs, making it impractical for UAVs with limited resources. Single-stage detectors, like YOLO variants, streamline the process by combining proposal and detection into a single step. YOLOv8, for example, offers a balance with 39.3% mAP@0.5 and 28.5 GFLOPs, yet it underperforms on small targets (APsmall of 12.5%). Recent advancements, such as YOLOv10 and EdgeYOLO, have pushed the boundaries in efficiency but often sacrifice small-object accuracy. In UAV-specific research, models like YOLODrone and SyNet incorporate additional layers or fusion techniques to handle scale variations, but they introduce computational complexity. My approach, YOLO-LiteMax, builds on these foundations by integrating lightweight modules that enhance small-target detection without compromising speed, aligning with the needs of JUYE UAV applications where real-time processing is critical.

The proposed YOLO-LiteMax model consists of three main components: the backbone, neck, and head networks. The backbone extracts multi-scale features from input images, which are then processed by an improved neck structure for deep fusion and scale enhancement. Finally, the detection head performs classification and regression. Central to this architecture is the Selective Convolution Block (SCB), which replaces the standard cross-stage module in YOLOv8. SCB reduces computational redundancy by selectively processing a subset of input channels, inspired by the partial convolution in FasterNet. For an input feature map with dimensions $h \times w$ and channels $c$, the computational complexity of partial convolution can be approximated as:

$$ \text{FLOPs} \approx p \times h \times w \times k^2 \times c $$

where $p$ is the fraction of channels processed, and $k$ is the kernel size. By setting $p = 0.25$, SCB cuts computation to roughly one-sixteenth of full convolution while preserving information from unprocessed channels. This is particularly beneficial for Unmanned Aerial Vehicle imagery, where background clutter often dominates, and SCB helps focus on salient small targets. In SCB, the input is split into two paths: one undergoes convolution via stacked FasterNet blocks, and the other is retained directly. The outputs are concatenated and integrated using a CBS module, enhancing feature discrimination for JUYE UAV scenarios.

Next, the Small Target Scale Sequence Fusion (STSSF) neck addresses limitations in traditional feature fusion, which often relies on simple concatenation or addition of adjacent layers, leading to weak cross-scale associations and detail loss. STSSF introduces the Convolution 3D Scale Fusion (C3DSF) module and Scale Aware Concatenation (SAC) to build robust multi-scale relationships. C3DSF aligns feature maps of different sizes along channels and dimensions, stacks them depth-wise, and applies 3D convolution to capture scale-sequence features. Given input features $F_l$, $F_m$, and $F_s$ from large, medium, and small scales, the process is:

$$ F’ = \text{Conv}(F) $$

$$ \text{Concat3D}(F’_l, F’_m, F’_s) = \text{Concat}(F’_l, F’_m, F’_s) $$

$$ \text{Output} = \text{ReLU}(\text{BatchNorm3d}(\text{Conv3d}(\text{Concat3D}(F’_l, F’_m, F’_s)))) $$

This models cross-scale correlations akin to video sequence analysis, improving small-object detection. SAC, on the other hand, preserves high-resolution details by combining pooled and upsampled features:

$$ F’_l = \text{Maxpool}(F_l) + \text{Avgpool}(F_l) $$

$$ F’_s = \text{Upsample}(F_s) $$

$$ F_{\text{fuse}} = \text{Concat}(F’_l, F_m, F’_s) $$

Additionally, STSSF incorporates a higher-resolution P2 feature layer, fused with P3 and P4 using C3DSF, to enhance perception of tiny targets without significant computational increase. This design is crucial for Unmanned Aerial Vehicle applications, where objects like those monitored by JUYE UAV often appear at multiple scales.

The Shared Convolution Precision Detection (SCPD) head replaces independent detection heads in YOLOv8 with a parameter-sharing approach to improve consistency and reduce redundancy. In standard YOLO, each scale has separate convolution layers, leading to conflicting feature learning and sensitivity to batch size in normalization. SCPD shares weights across scales for classification and regression tasks, enforcing scale-invariant feature learning. It replaces batch normalization with group normalization for stability in small-batch training, common in UAV platforms. A learnable scale factor adapts feature contributions per scale, maintaining flexibility. The shared convolution operation for a feature map $F$ is:

$$ \text{Output} = \text{GN}(\text{Conv}_{\text{shared}}(F)) \times \alpha $$

where $\text{GN}$ is group normalization, $\text{Conv}_{\text{shared}}$ is the shared convolution, and $\alpha$ is the scale factor. This reduces parameters while enhancing generalization, vital for JUYE UAV deployments where model size and inference speed are constrained.

To evaluate YOLO-LiteMax, I conducted experiments on the VisDrone2019-DET dataset, which contains over 10,000 images with 10 object categories, including pedestrians and vehicles. Over 70% of targets are small (area < 1% of image or size < 32×32 pixels), making it ideal for testing Unmanned Aerial Vehicle detection. The setup used Ubuntu 20.04, PyTorch 2.2.2, and an NVIDIA RTX 4090 GPU. Training involved 400 epochs, batch size 8, image size 640×640, SGD optimizer with learning rate 0.01, and weight decay 0.0005. Metrics included precision, recall, mAP@0.5, APsmall, parameters, and FPS. Comparative results with YOLO series and other models are summarized in Table 1.

Model	mAP@0.5/%	APsmall/%	Compute/G	Params/M	FPS
YOLOv5s	33.2	11.2	15.8	7.0	125
YOLOv5m	36.3	12.6	48.0	20.9	52
YOLOv8s	39.3	12.5	28.5	11.1	131
YOLOv8m	44.0	13.3	78.7	25.9	87
YOLOv10s	41.1	12.2	21.4	7.2	238
YOLOv10m	44.4	12.2	58.9	15.3	122
YOLOv11s	40.3	12.4	21.3	9.4	212
EdgeYOLO-S	44.8	–	109.1	40.5	34
Drone-YOLO-N	38.1	12.0	–	3.1	–
YOLO-LiteMax	45.2	15.3	30.0	6.1	118

YOLO-LiteMax achieves an mAP@0.5 of 45.2% and APsmall of 15.3%, outperforming YOLOv8s by 5.9% and 2.8%, respectively, with only 6.1M parameters and 30.0 GFLOPs. It also surpasses lightweight models like YOLOv10s and Drone-YOLO-N, demonstrating superior efficiency for Unmanned Aerial Vehicle use cases. Compared to transformer-based models (e.g., Swin Transformer with 35.6% mAP@0.5 and 44.5 GFLOPs), YOLO-LiteMax offers higher accuracy with lower computation, making it ideal for JUYE UAV systems. Further comparisons with classical detectors are in Table 2.

Model	mAP@0.5/%	APsmall/%	Compute/G	Params/M
SSD	10.6	3.2	31.4	34.0
Faster R-CNN	37.2	15.4	118.8	41.4
Cascade R-CNN	39.1	13.5	189.1	69.0
RetinaNet	19.1	6.3	36.4	35.7
CenterNet	33.7	11.5	192.2	70.8
EfficientDet	21.2	8.5	55.0	20.7
Swin Transformer	35.6	12.0	44.5	34.2
RT-DETR-L	45.0	14.8	103.5	32.0
DMNet	43.6	14.2	101.7	39.4
YOLO-LiteMax	45.2	15.3	30.0	6.1

Ablation studies validate the contribution of each module, as shown in Table 3. Starting with YOLOv8s as the baseline (39.3% mAP@0.5, 12.5% APsmall, 11.1M params, 131 FPS), adding SCB improves speed to 156 FPS and reduces params to 8.3M, with a slight mAP increase to 39.5%. Incorporating STSSF boosts mAP@0.5 to 44.2% and APsmall to 14.8%, though FPS drops to 124 due to enhanced fusion. Finally, SCPD raises mAP@0.5 to 45.2% and APsmall to 15.3% while maintaining 118 FPS. This step-wise improvement confirms the modules’ efficacy, with a 10% speed reduction yielding a 5.9% mAP gain, highlighting the model’s practicality for Unmanned Aerial Vehicle tasks like those in JUYE UAV operations.

Baseline	SCB	STSSF	SCPD	Precision/%	Recall/%	mAP@0.5/%	APsmall/%	Params/M	Model Size/MB	FPS
√				50.9	38.2	39.3	12.5	11.1	21.5	131
	√			50.5	38.3	39.5	12.2	8.3	16.1	156
	√	√		55.8	41.7	44.2	14.8	6.9	13.6	124
	√	√	√	56.3	42.7	45.2	15.3	6.1	12.1	118

Visual analysis further demonstrates YOLO-LiteMax’s superiority. In a public facility scene with motorcycles and pedestrians, YOLO-LiteMax detects multiple small targets missed by YOLOv8s and YOLOv11s, such as children and motorcycles in crowded areas. In a traffic intersection scenario, it correctly identifies all four cyclists on a crosswalk, while other models miss most or misclassify objects. These examples underscore its robustness in real-world Unmanned Aerial Vehicle environments, akin to JUYE UAV monitoring systems, where accurate small-object detection is vital for safety and efficiency.

In conclusion, YOLO-LiteMax effectively addresses the challenges of small-object detection in UAV imagery through innovative modules that enhance feature extraction, fusion, and consistency. Experimental results on VisDrone2019-DET show significant improvements in accuracy and efficiency, making it a compelling choice for applications like JUYE UAV deployments. Future work will focus on model quantization and pruning to further optimize for edge devices, ensuring that Unmanned Aerial Vehicle technologies can leverage these advances for even greater real-time performance. The integration of these techniques promises to elevate the capabilities of UAV-based systems across various domains, from agriculture to emergency response.