An Improved YOLOv11-Based Detection Algorithm for UAV Drone Aerial Images

This paper presents a novel improvement over the YOLOv11 model specifically designed to address the challenging task of detecting small objects in images captured by UAV drones. The detection of small targets in aerial imagery is inherently difficult due to factors such as low pixel resolution, weak feature representation, frequent occlusion, and complex backgrounds. These issues often lead to high rates of missed detections and false positives in standard object detection frameworks.

Our work is motivated by the need for a more robust and accurate solution for UAV drones in applications like traffic monitoring, agricultural inspection, and emergency response. We propose three key modifications to the baseline YOLOv11n model: the integration of a Convolutional Attention Fusion Module (CAFM) in the neck, the replacement of the standard detection head with a Dynamic Head (DyHead), and the adoption of WIoUv3 as the bounding box regression loss function. These innovations collectively enhance the model’s ability to focus on small, occluded objects while maintaining real-time performance suitable for UAV drones.

The effectiveness of our method is validated through extensive experiments on the VisDrone2019 dataset, which contains a wide variety of small objects such as pedestrians, vehicles, and bicycles. The results demonstrate that our improved model significantly outperforms the baseline and other state-of-the-art YOLO variants in terms of precision, recall, and mean average precision (mAP).

Introduction

The proliferation of UAV drones has revolutionized many fields, including crop monitoring, traffic management, and disaster response. However, the unique perspective of aerial images presents significant challenges for standard object detection algorithms. Objects of interest often occupy only a few pixels in the image, are easily occluded by other objects or environmental structures, and are subject to varying illumination and scale. While the YOLO (You Only Look Once) family of detectors is renowned for its balance of speed and accuracy, it struggles with these small targets in complex scenes captured by UAV drones.

Target detection algorithms are broadly classified into two-stage and one-stage methods. Two-stage methods, such as R-CNN and Faster R-CNN, first generate region proposals and then classify them, achieving high accuracy at the cost of speed. One-stage methods like SSD and the YOLO series directly predict class probabilities and bounding box coordinates, offering superior real-time performance. However, for the specific domain of UAV drones aerial images, even one-stage methods require careful tailoring to handle the challenges of small object detection.

Recent research has focused on adapting YOLO models for this task. For instance, some works have incorporated deformable convolutions or attention mechanisms to improve feature extraction for UAV drones imagery. Others have redesigned the detection head or loss function to better handle scale variations and occlusions. Our work builds upon these ideas by synergistically combining three powerful components: a feature fusion module that captures both local and global contexts, a dynamic detection head that adapts to scale and spatial information, and an advanced loss function that focuses on high-quality anchor boxes.

YOLOv11 Overview

The YOLOv11 model builds upon the successes of its predecessors. It begins with a backbone network that uses the C3k2 module, an improvement over the C2f module in YOLOv8, to enhance feature extraction efficiency. The C2PSA module integrates multi-head self-attention with a feed-forward network to capture long-range dependencies. The neck of the network uses a Path Aggregation Network (PAN-FPN) structure to fuse multi-scale features, and the detection head employs depthwise separable convolutions (DWConv) to reduce parameter count and computational cost.

The standard YOLOv11n, while efficient, is not optimized for the dense, small objects typical of UAV drones imagery. Its loss function and feature fusion mechanisms are insufficient for distinguishing closely packed objects or those with weak features. This forms the basis for our proposed improvements.

Proposed Methodology

To enhance YOLOv11n for UAV drones aerial image detection, we introduce three key modifications: the CAFM module, the DyHead, and the WIoUv3 loss function.

Dynamic Detection Head (DyHead)

The standard FPN-based detection head in YOLOv11 has limited cross-layer interaction, making it difficult to handle dense small targets. DyHead unifies scale-aware, space-aware, and task-aware attention mechanisms. It dynamically adjusts feature weights, focusing on the most relevant information for both classification and regression. The scale-aware attention adjusts feature weights for objects of different sizes. The space-aware attention highlights critical spatial regions while suppressing background noise. The task-aware attention modulates feature channels based on the specific needs of classification and regression tasks, alleviating task conflicts. The DyHead computation is defined as:

$$ W(F) = \pi_C(\pi_S(\pi_L(F) \times F) \times F) $$

Where $\pi_L$, $\pi_S$, and $\pi_C$ represent the scale, spatial, and task-aware attention functions, respectively.

Convolution and Attention Fusion Module (CAFM)

Small targets suffer from weak semantic information and are easily confused with the background. We introduce CAFM in the neck to fuse local details captured by CNNs with long-range dependencies modeled by Transformers. This module has two parallel branches:

Local Branch: Uses a 1×1 convolution for channel reduction, followed by channel shuffle and 3×3 depthwise separable convolution to extract local spatial features. This is formulated as:

$$ F_{conv} = W_{3\times 3\times 3}(CS(W_{1\times 1}(Y))) $$

Where $F_{conv}$ is the local branch output, $W_{1\times 1}$ and $W_{3\times 3\times 3}$ are convolutions, $Y$ is the input, and $CS$ is channel shuffle.

Global Branch: Uses 1×1 and 3×3 depthwise separable convolutions to generate Query (Q), Key (K), and Value (V) tensors. A softmax function is used to compute the attention map. The output is:

$$ F_{att} = W_{1\times 1}Attention(\hat{Q}; \hat{K}; \hat{V}) + Y $$

Where $Attention(\hat{Q}; \hat{K}; \hat{V}) = \hat{V} \cdot Softmax(\hat{Q}\hat{K}^T / \alpha)$, and $\alpha$ is a learnable scaling parameter.

The final output of the CAFM module is the element-wise sum of the two branches: $F_{out} = F_{att} + F_{conv}$. This fusion allows the model to simultaneously exploit fine-grained local features and robust global context, significantly improving detection of small objects in UAV drones images.

WIoUv3 Loss Function

YOLOv11 uses CIoU loss, which is sensitive to small targets and occluded scenes. WIoUv3 introduces a non-monotonic focusing mechanism that dynamically adjusts gradient weights based on the outlierness of anchor boxes, suppressing harmful gradients from low-quality samples. The loss function is built upon WIoUv1:

$$ L_{WIoUv1} = R_{WIoU} \cdot L_{IoU} $$

Where $R_{WIoU} = \exp\left(\frac{(x – x_{gt})^2 + (y – y_{gt})^2}{W_g^2 + H_g^2}\right)$ and $L_{IoU} = 1 – IoU$.

WIoUv3 further introduces a focusing factor $r$ based on an outlier degree $\beta$:

$$ L_{WIoUv3} = r \cdot L_{WIoUv1} $$

Where $r = \frac{\beta}{\delta \alpha^{\beta – \delta}}$ and $\beta = \frac{L_{IoU}^*}{\overline{L_{IoU}}}$. This mechanism allows the model to focus training on medium-quality samples, improving localization robustness for small targets in UAV drones datasets.

Experimental Results and Analysis

We evaluated our method on the VisDrone2019 dataset, which contains 8,629 images across 10 classes. The experimental environment was Windows 11 with an RTX 4060 GPU, PyTorch 2.0, and CUDA 11.8. We used an input size of 640×640, 300 epochs, a batch size of 8, and the SGD optimizer with an initial learning rate of 0.01. The evaluation metrics included Precision (P), Recall (R), mAP50, mAP50-95, Parameters (M), GFLOPs, and FPS.

Ablation Study

We conducted a systematic ablation study to assess the contribution of each proposed component, using YOLOv11n as the baseline. The results are summarized in the table below:

Baseline	DyHead	CAFM	WIoUv3	P (%)	R (%)	mAP50 (%)	mAP50-95 (%)	Param (M)	GFLOPs	FPS
√				45.8	34.6	35.0	20.5	25.8	6.3	155.3
√	√			48.7	36.0	37.1	21.8	31.0	7.5	94.9
√		√		47.0	34.6	35.2	20.5	35.0	10.7	90.2
√			√	46.3	35.3	35.7	20.6	25.8	6.3	153.8
√	√	√	√	50.3	37.4	38.4	22.5	39.9	11.5	64.2

As shown in the ablation table, each component individually improves the mAP50 over the baseline: DyHead by 2.1%, CAFM by 0.2%, and WIoUv3 by 0.7%. Crucially, when combined, they yield a synergistic improvement of 3.4% in mAP50 and 2.0% in mAP50-95. The increase in parameters and GFLOPs is a trade-off for significant accuracy gains. While the FPS drops from 155.3 to 64.2, it remains well within the real-time requirement for many UAV drones applications.

Comparison with Other YOLO Variants

We compared our improved YOLOv11n model (denoted Ours) with other YOLO versions. The results are presented in the following table:

Method	P (%)	R (%)	mAP50 (%)	mAP50-95 (%)	Param (M)	GFLOPs	FPS
YOLOv5n	43.2	32.5	32.3	18.6	25.0	7.1	189.9
YOLOv6n	41.8	30.6	30.3	17.7	42.3	11.8	197.7
YOLOv8n	44.4	33.4	33.5	19.4	30.0	8.1	189.8
YOLOv10n	44.9	33.1	33.3	19.5	26.9	8.2	160.5
YOLOv11n	45.8	34.6	35.0	20.5	25.8	6.3	155.3
Ours	50.3	37.4	38.4	22.5	39.9	11.5	64.2

The comparison table clearly illustrates that our model achieves the highest precision, recall, mAP50, and mAP50-95 among all evaluated models. It outperforms the baseline YOLOv11n by 4.5% in precision and 2.8% in recall. Although it has a larger parameter count and lower FPS, the superior detection accuracy makes it a more suitable choice for precision-critical tasks involving UAV drones.

Conclusion

In this paper, we have proposed an improved YOLOv11n model specifically for detecting small objects in UAV drones aerial images. Our method introduces three key enhancements: the CAFM module for fusing local and global features, the DyHead for dynamic scale and spatial attention, and the WIoUv3 loss function for robust bounding box regression. Extensive experiments on the challenging VisDrone2019 dataset confirm that our model significantly outperforms existing YOLO variants, achieving a 3.4% improvement in mAP50 and a 2.0% improvement in mAP50-95 over the baseline. The visual results further demonstrate the model’s superior performance in handling occlusion and complex backgrounds, making it a highly effective solution for real-world applications involving UAV drones.

Future work will focus on optimizing the model for better real-time performance on embedded systems and extending the framework to handle other challenging detection scenarios often encountered by UAV drones.