Enhancing Small Object Detection in UAV Aerial Imagery via Improved YOLOv11

The rapid proliferation and application of Unmanned Aerial Vehicles (UAVs), commonly known as drones, have revolutionized data acquisition across numerous fields. From disaster assessment and traffic monitoring to agricultural surveys and security patrols, UAV drones provide a flexible, efficient, and safe platform for capturing high-resolution aerial imagery. However, a significant technical challenge arises from this perspective: the reliable detection of small objects. Targets such as pedestrians, vehicles, or infrastructure components often occupy merely a few dozen pixels in an image captured from a high altitude. This inherent small size, combined with issues like complex backgrounds, dense distributions, and occlusion, leads to severe performance degradation for standard object detection models, manifesting as high rates of missed and false detections.

To tackle the specific challenge of small object detection in UAV drone imagery, this work proposes a comprehensive enhancement of the YOLOv11s model. The YOLO (You Only Look Once) family, as single-stage detectors, offers an excellent balance between speed and accuracy, making them suitable for real-time applications often required in UAV drone operations. The proposed improvements are threefold: the integration of a Coordinate Attention (CA) mechanism to sharpen spatial localization, the replacement of the original neck with a more sophisticated ASFYOLO (Attention Scale Fusion YOLO) structure for superior multi-scale feature fusion, and the augmentation of the core building block using Receptive Field Attention Convolution (RFAConv) to enhance detail extraction for blurred or occluded small objects. Extensive experiments on the challenging VisDrone2019 benchmark demonstrate that our proposed model achieves a marked improvement in detection accuracy over the baseline and other state-of-the-art methods, effectively addressing the pitfalls of small target detection in UAV drone imagery.

Architectural Overview of the Baseline YOLOv11 Model

The YOLOv11 architecture follows the canonical design of modern single-stage detectors, comprising four main components: Input, Backbone, Neck, and Head. The Backbone, typically a deep CNN like CSPDarknet, is responsible for extracting hierarchical feature maps from the input image. These features are then processed and fused by the Neck module (often a Path Aggregation Network – PANet variant) to build a rich feature pyramid containing both high-level semantic information and low-level spatial details. Finally, the Detection Head makes predictions for bounding boxes, objectness scores, and class probabilities at multiple scales.

The YOLOv11s model, the “small” variant in the family, was chosen as our baseline due to its favorable trade-off between computational cost and performance, which is crucial for potential deployment on systems with limited resources, a common constraint in UAV drone applications. Its streamlined architecture provides a solid foundation upon which our targeted enhancements are built to specifically address small object deficiencies.

Proposed Enhancements for UAV Drone-Based Small Object Detection

The standard YOLOv11s model, while efficient, lacks specialized mechanisms to handle the extreme challenges posed by UAV drone imagery. Our improvements are designed to inject targeted capabilities for spatial awareness, multi-scale understanding, and fine-grained feature extraction.

1. Integration of Coordinate Attention (CA) Mechanism

Precise spatial localization is paramount for detecting minuscule objects that may be only a handful of pixels wide. Standard convolutions or even channel-only attention mechanisms like SE (Squeeze-and-Excitation) often lose fine-grained positional information through successive down-sampling operations. To mitigate this, we incorporate the Coordinate Attention (CA) block at strategic points: at the end of the Backbone and within the enhanced Neck network. The CA mechanism uniquely embeds positional information by decomposing global pooling into one-dimensional feature encoding operations along the horizontal and vertical directions.

Given an intermediate feature map X with channel count C, height H, and width W, CA performs the following operations:

Coordinate Information Embedding: It uses one-dimensional average pooling to aggregate features along each spatial direction. For the c-th channel, the output for height h and width w is:
$$z_c^h(h) = \frac{1}{W} \sum_{0 \leq i < W} x_c(h, i)$$
$$z_c^w(w) = \frac{1}{H} \sum_{0 \leq j < H} x_c(j, w)$$

Coordinate Attention Generation: The concatenated embeddings [z^h, z^w] are transformed via a shared 1×1 convolution F1, batch normalization, and a non-linear activation (like ReLU or SiLU). The resulting tensor is split back into two separate spatial descriptors, which are processed by individual 1×1 convolutions (f_h and f_w) and Sigmoid functions to generate the final attention weights g^h and g^w:
$$g^h = \sigma(f_h(F_1([z^h, z^w])))$$
$$g^w = \sigma(f_w(F_1([z^h, z^w])))$$

Re-weighting: The output feature Y is computed by applying the attention weights multiplicatively:
$$y_c(i, j) = x_c(i, j) \times g_c^h(i) \times g_c^w(j)$$

This process allows the network to focus on regions likely containing small targets while suppressing irrelevant background clutter, a critical ability for analyzing complex scenes captured by UAV drones.

2. Advanced Neck Design with ASFYOLO Framework

The scale variation of objects in UAV drone imagery is extreme, ranging from large structures to tiny, distant specks. The original PANet-style neck in YOLOv11 may not adequately integrate features across such a vast scale range. We replace it entirely with the ASFYOLO (Attention Scale Fusion YOLO) neck, a framework explicitly designed for robust multi-scale detection, particularly for small objects.

The ASFYOLO framework introduces several key innovations. The Scale Sequence Feature Fusion (SSFF) module enhances the network’s capacity to extract and preserve information across different scales. The Triple Feature Encoder (TFE) module fuses feature maps from different levels to enrich detailed information. Crucially, a Channel and Position Attention Module (CPAM) integrates the detailed and multi-scale features from the SSFF and TFE paths. The CPAM jointly applies channel-wise and spatial attention, forcing the network to concentrate on informative channels and the precise spatial locations most relevant to small objects. This sophisticated fusion pathway ensures that weak but critical signals from small targets are amplified and effectively propagated to the detection heads.

3. Enhanced Feature Extraction with RFAConv in C3K2 Blocks

Standard 3×3 convolutions apply a shared kernel across the entire spatial domain, which can be suboptimal for capturing the unique, often blurred or partial, patterns of a small object. To empower the network with adaptive, location-sensitive feature extraction, we modify the fundamental C3K2 module in the Backbone by integrating Receptive Field Attention Convolution (RFAConv).

RFAConv dynamically adjusts the convolution kernel’s spatial weights based on the local features within its receptive field. It generates a non-shared Receptive-field Attention (RFA) map for each sliding window position. For an input feature patch X_p within a kernel’s receptive field, the process can be summarized as:

Attention Generation: A lightweight sub-network (e.g., a small fully-connected layer) processes the flattened patch to produce an attention vector α_p:
$$\alpha_p = \phi(\text{Flatten}(X_p))$$
where φ denotes the attention generation function followed by a normalization like Softmax.

Attentive Convolution: The standard convolution weights W are modulated by this attention map. The output at a position is computed as:
$$Y(p) = \sum_{k} \alpha_p^{(k)} \cdot (W^{(k)} * X_p^{(k)})$$
where k indexes over the kernel’s spatial positions and channels.

By replacing the standard Conv layers within the Bottleneck of the C3K2 module with RFAConv, we create a C3K2-RFA block. This allows the network to focus computational resources on the most discriminative few pixels of a small target, significantly boosting its ability to discern fine details against cluttered UAV drone backgrounds.

Experimental Validation and Analysis

We rigorously evaluated our proposed model on the VisDrone2019 dataset, a large-scale benchmark collected by various UAV drones under diverse conditions. It contains 10,209 images annotated with ten common object categories (e.g., pedestrian, car, van) and is split into training (6,471), validation (548), and test (1,610) sets. Performance is measured using standard metrics: mean Average Precision at IoU=0.5 (mAP@0.5), mAP across IoU thresholds from 0.5 to 0.95 (mAP@0.5:0.95), Precision (P), Recall (R), and computational metrics like parameters and GFLOPs. The formulas for key metrics are:
$$Precision (P) = \frac{TP}{TP + FP}$$
$$Recall (R) = \frac{TP}{TP + FN}$$
$$AP = \int_{0}^{1} P(R) dR$$
$$mAP = \frac{1}{N} \sum_{i=1}^{N} AP_i$$
where TP, FP, and FN are True Positives, False Positives, and False Negatives, respectively, and N is the number of classes.

Ablation Study

To validate the contribution of each component, we conducted an ablation study starting from the YOLOv11s baseline. The results are summarized in the table below.

Model Configuration	Params (M)	GFLOPs	mAP@0.5	mAP@0.5:0.95
YOLOv11s (Baseline)	9.4	21.3	0.352	0.213
+ CA Attention	10.8	26.5	0.371	0.225
+ ASFYOLO Neck	12.2	31.7	0.384	0.231
+ C3K2-RFAConv	10.8	26.9	0.401	0.239
Full Model (Ours)	12.4	32.5	0.417	0.241

The ablation study clearly demonstrates the cumulative benefit of each modification. Adding CA alone improves mAP@0.5 by 1.9%, confirming its role in better spatial localization. The ASFYOLO neck provides a further significant boost, highlighting the importance of advanced multi-scale fusion. The C3K2-RFAConv module shows a substantial individual gain of 4.9% in mAP@0.5, proving the effectiveness of dynamic receptive field attention for detail extraction. The full integration of all three components yields the best performance, achieving a 6.5% and 2.8% absolute increase in mAP@0.5 and mAP@0.5:0.95 over the baseline, respectively. This synergy validates our design philosophy for tackling UAV drone small object detection.

Comparative Analysis with State-of-the-Art Methods

We compared our final model against a range of popular object detectors on the VisDrone2019 test set. The results, presented in the following table, underscore the superiority of our approach.

Algorithm	Params (M)	GFLOPs	mAP@0.5	mAP@0.5:0.95
Faster R-CNN	136.9	370.0	0.186	0.089
SSD	25.0	61.8	0.112	0.050
EfficientDet	3.8	4.8	0.122	0.071
CenterNet	32.7	70.2	0.210	0.104
YOLOv5s	7.0	15.8	0.324	0.174
YOLOv8s	11.1	28.5	0.313	0.181
YOLOv9s	7.2	26.7	0.323	0.188
YOLOv10s	7.2	21.4	0.308	0.177
YOLOv11s (Baseline)	9.4	21.3	0.352	0.213
Our Model (Ours)	12.4	32.5	0.417	0.241

Our model not only outperforms its direct baseline, YOLOv11s, by a significant margin but also surpasses other efficient YOLO variants (v5s, v8s, v9s, v10s) and classic detectors like Faster R-CNN and SSD. This demonstrates that the proposed enhancements effectively address the core challenges specific to UAV drone imagery, leading to state-of-the-art detection accuracy among compared efficient models. A per-class analysis further reveals that our improvements are most pronounced for the smallest and most challenging categories like ‘pedestrian’ and ‘people’, where AP scores increased by over 20 percentage points compared to the baseline, solidifying its capability for practical UAV drone applications.

Conclusion and Future Work

In this work, we presented a targeted enhancement of the YOLOv11s model to address the formidable challenge of small object detection in UAV drone aerial imagery. By systematically integrating a Coordinate Attention mechanism for spatial sharpening, an ASFYOLO neck for robust multi-scale fusion, and RFAConv-augmented blocks for detail-aware feature extraction, we developed a model that significantly improves detection fidelity for tiny, dense, and often obscured targets. Comprehensive experiments on the demanding VisDrone2019 benchmark validate the effectiveness of each component and their synergistic combination, achieving a mAP@0.5 of 41.7%, a substantial 6.5% absolute gain over the strong YOLOv11s baseline.

The proposed model offers a potent solution for real-world UAV drone vision tasks where accurate small object detection is critical. Future work will focus on further optimizing the computational footprint of the ASFYOLO and RFAConv modules to enhance suitability for deployment on edge devices with stringent power and latency constraints, commonly found on-board UAV drones. Exploring knowledge distillation or neural architecture search techniques could yield an even more efficient variant without compromising the achieved accuracy gains.