
With the widespread adoption of China UAV technology, aerial image analysis has become a cornerstone for applications such as precision agriculture, intelligent traffic management, emergency response, and infrastructure inspection. However, small objects in these images suffer from extremely low pixel occupancy, weak semantic features, and frequent occlusion by complex backgrounds, leading to severe missed detections and false positives. While YOLO series detectors offer a good trade-off between real-time performance and accuracy, existing models still struggle with small object detection in China UAV scenarios. To address these limitations, we propose an enhanced YOLOv11 model that integrates three key innovations: a Convolutional Attention Fusion Module (CAFM) in the neck, a dynamic detection head (DyHead), and a novel WIoUv3 boundary regression loss. These modifications significantly boost the model’s ability to perceive tiny targets, handle occlusions, and improve localization robustness.
Related Work and Baseline Overview
Object detection algorithms are broadly categorized into two-stage and one-stage frameworks. Two-stage methods like Faster R-CNN and Mask R-CNN first generate region proposals and then perform classification and regression, achieving high accuracy but at low inference speed. One-stage detectors such as SSD and the YOLO family directly predict object classes and bounding boxes from the entire image, offering superior real-time performance. Among them, YOLOv11 has gained attention for its efficient design: it replaces the C2f module with C3k2 to improve feature extraction, introduces a C2PSA module for multi-head attention and feed-forward network integration, and incorporates depthwise separable convolution (DWConv) in its detection head to reduce computational cost. Despite these advancements, YOLOv11 still faces challenges in China UAV small object detection, especially under dense clutter and illumination variations.
Proposed Improvements
We introduce three targeted enhancements to YOLOv11, each designed to overcome specific bottlenecks for China UAV imagery. The overall architecture is illustrated conceptually, and the core modifications are detailed below.
1. Dynamic Detection Head (DyHead)
The default YOLOv11 detection head uses a simple feature pyramid network (FPN) that lacks sufficient cross-level interaction, making it ineffective for dense small targets. DyHead unifies scale-aware, spatial-aware, and task-aware attention mechanisms into a single dynamic head. The scale-aware attention adaptively adjusts feature weights for objects of different sizes; the spatial-aware attention highlights critical regions while suppressing background clutter; the task-aware attention dynamically reweights features according to the demands of classification and regression subtasks. These three attention modules are cascaded and applied to the input feature tensor. The operation can be expressed as:
$$
W(F) = \pi_C\big(\pi_S\big(\pi_L(F) \times F\big) \times F\big) \times F
$$
Here, $\pi_L$, $\pi_S$, and $\pi_C$ represent the scale, spatial, and task attention functions, respectively. By applying this multi-dimensional attention, DyHead significantly enhances feature representation for small objects in China UAV images.
2. Convolutional Attention Fusion Module (CAFM)
Small objects in China UAV images often carry weak semantic information and are easily contaminated by background noise. We introduce CAFM into the neck of YOLOv11 to bridge local and global context. CAFM consists of two parallel branches: a local branch using grouped depthwise convolution to capture fine-grained spatial details, and a global branch employing a simplified self-attention mechanism (with softmax normalized queries and keys) to model long-range dependencies. The outputs are fused through element-wise addition, yielding rich multi-scale feature maps. The local branch computation is:
$$
F_{conv} = W_{3\times3\times3}\big( CS( W_{1\times1}(Y) ) \big)
$$
where $Y$ is the input, $W_{1\times1}$ and $W_{3\times3\times3}$ denote 1×1 and 3×3×3 convolutions, and $CS$ indicates channel shuffle. The global branch output is:
$$
F_{att} = W_{1\times1}\big( \mathrm{Attention}(\hat{Q}; \hat{K}; \hat{V}) \big) + Y
$$
$$
\mathrm{Attention}(\hat{Q}; \hat{K}; \hat{V}) = \hat{V} \cdot \mathrm{Softmax}(\hat{Q}\hat{K} / \alpha)
$$
The final output of CAFM is $F_{out} = F_{att} + F_{conv}$, which combines local inductive bias and global context awareness, significantly boosting small object detection robustness for China UAV tasks.
3. WIoUv3 Loss Function
The default YOLOv11 uses CIoU loss, which is not well-suited for small objects and occluded scenarios. We replace it with WIoUv3, which introduces a non-monotonic focusing mechanism. WIoUv3 first computes a basic WIoUv1 loss with a distance-aware penalty:
$$
L_{WIoUv1} = R_{WIoU} L_{IoU}
$$
$$
R_{WIoU} = \exp\Big( \frac{(x-x^{gt})^2 + (y-y^{gt})^2}{W_g^2 + H_g^2} \Big)
$$
where $L_{IoU}=1-\mathrm{IoU}$, and $W_g$, $H_g$ are the minimum bounding box sizes. WIoUv2 further introduces a monotonic focusing coefficient to emphasize hard examples. The key innovation of WIoUv3 lies in its non-monotonic focus factor $r$, which dynamically adjusts gradient weights based on an outlier metric $\beta$:
$$
L_{WIoUv3} = r \cdot L_{WIoUv1}, \quad r = \frac{\beta}{\delta \alpha^{\beta-\delta}}, \quad \beta = \frac{L_{IoU}^{*}}{\overline{L_{IoU}}}
$$
Here, $\beta$ quantifies the degree of outlier of an anchor box. Anchor boxes with very high or very low quality are suppressed, while medium-quality boxes receive higher gradient gains. This mechanism stabilizes training and improves localization precision for small and occluded objects in China UAV datasets.
Experiments and Results
3.1 Dataset and Implementation Details
We evaluate our method on the VisDrone2019 dataset, which contains 8,629 images covering 10 common object categories (pedestrian, car, bicycle, etc.) captured from China UAVs under various weather and lighting conditions. All experiments are conducted on an RTX4060 GPU with PyTorch 2.0 and CUDA 11.8. Training parameters: input resolution 640×640, 300 epochs, batch size 8, SGD optimizer with initial learning rate 0.01. We report Precision (P), Recall (R), mean Average Precision at IoU=0.5 (mAP50) and at IoU=0.5:0.95 (mAP50-95), model parameters (Params), GFLOPS, and FPS.
3.2 Ablation Study
We conduct systematic ablation experiments with YOLOv11n as the baseline. Table 1 summarizes the results.
| Baseline | DyHead | CAFM | WIoUv3 | P (%) | R (%) | mAP50 (%) | mAP50-95 (%) | Params (M) | GFLOPs | FPS |
|---|---|---|---|---|---|---|---|---|---|---|
| ✓ | 45.8 | 34.6 | 35.0 | 20.5 | 25.8 | 6.3 | 155.3 | |||
| ✓ | ✓ | 48.7 | 36.0 | 37.1 | 21.8 | 31.0 | 7.5 | 94.9 | ||
| ✓ | ✓ | 47.0 | 34.6 | 35.2 | 20.5 | 35.0 | 10.7 | 90.2 | ||
| ✓ | ✓ | 46.3 | 35.3 | 35.7 | 20.6 | 25.8 | 6.3 | 153.8 | ||
| ✓ | ✓ | ✓ | ✓ | 50.3 | 37.4 | 38.4 | 22.5 | 39.9 | 11.5 | 64.2 |
From Table 1, each component contributes positively. DyHead alone improves mAP50 by 2.1% (from 35.0% to 37.1%), CAFM by 0.2%, and WIoUv3 by 0.7%. When combined, the full model achieves 38.4% mAP50 and 22.5% mAP50-95, representing a 3.4% and 2.0% improvement over baseline, respectively. The inference speed drops to 64.2 FPS due to increased model complexity (39.9M parameters and 11.5 GFLOPs), but it remains fully real-time for practical China UAV applications.
3.3 Comparison with State-of-the-Art Detectors
We compare our improved model against YOLOv5n, YOLOv6n, YOLOv8n, YOLOv10n, and the original YOLOv11n on the same VisDrone2019 dataset. Results are in Table 2.
| Method | P (%) | R (%) | mAP50 (%) | mAP50-95 (%) | Params (M) | GFLOPs | FPS |
|---|---|---|---|---|---|---|---|
| YOLOv5n | 43.2 | 32.5 | 32.3 | 18.6 | 25.0 | 7.1 | 189.9 |
| YOLOv6n | 41.8 | 30.6 | 30.3 | 17.7 | 42.3 | 11.8 | 197.7 |
| YOLOv8n | 44.4 | 33.4 | 33.5 | 19.4 | 30.0 | 8.1 | 189.8 |
| YOLOv10n | 44.9 | 33.1 | 33.3 | 19.5 | 26.9 | 8.2 | 160.5 |
| YOLOv11n | 45.8 | 34.6 | 35.0 | 20.5 | 25.8 | 6.3 | 155.3 |
| Ours | 50.3 | 37.4 | 38.4 | 22.5 | 39.9 | 11.5 | 64.2 |
Our model outperforms all compared detectors in precision, recall, and mAP metrics. It achieves 4.5% higher precision and 2.8% higher recall than YOLOv11n, with mAP50 and mAP50-95 improvements of 3.4% and 2.0%, respectively. Although parameter count and inference time increase, the speed of 64.2 FPS comfortably meets the real-time requirements of most China UAV deployment scenarios.
3.4 Qualitative Evaluation
Visual inspection of detection results confirms that our model significantly reduces false negatives for small and occluded objects, such as pedestrians partially hidden by trees or vehicles in dense traffic. The model also maintains robust performance under low-light and nighttime conditions, which are common in China UAV operations. The improvements are particularly evident for categories like “motor” and “bicycle”, which are notoriously difficult to detect due to their small scale and frequent occlusion.
Discussion and Conclusion
In this work, we proposed an improved YOLOv11-based detector tailored for China UAV aerial image small object detection. By integrating the Convolutional Attention Fusion Module (CAFM) in the neck, the Dynamic Detection Head (DyHead), and the WIoUv3 loss function, our model achieves substantial gains in detection accuracy while maintaining real-time performance. Extensive experiments on the VisDrone2019 dataset validate the effectiveness of each component and the overall superiority over existing lightweight detectors. Future research directions include model compression techniques such as pruning and quantization to further reduce computational cost, enabling deployment on edge devices carried by China UAVs. Additionally, exploring attention mechanisms specifically designed for ultra-small objects and integrating temporal information from video streams could further enhance robustness in dynamic scenes.
