In recent years, the rapid proliferation of China drone technology has revolutionized a wide array of civilian and industrial applications, including precision agriculture, urban traffic management, emergency rescue operations, and infrastructure inspection. Aerial images captured by China drone platforms have become indispensable for real-time monitoring and intelligent decision-making. However, the detection of small objects in these aerial images remains a formidable challenge due to the inherently low pixel occupancy, weak feature representation, frequent occlusions, and complex background interference. Despite the remarkable progress achieved by the YOLO series of object detectors in balancing speed and accuracy, significant performance degradation occurs when applied to small object detection tasks in China drone aerial imagery. To address these critical limitations, this study presents a comprehensive enhancement to the YOLOv11 model, specifically tailored for China drone aerial image detection. The proposed framework integrates three novel components: the Convolutional Attention Fusion Module (CAFM) for enriched multi-scale feature representation, the Dynamic Detection Head (DyHead) for adaptive multi-dimensional attention, and the WIoUv3 loss function for robust bounding box regression. Extensive experiments conducted on the VisDrone2019 benchmark dataset demonstrate that our improved model achieves substantial gains in both precision and recall, making it highly suitable for real-world China drone deployment scenarios.
Introduction and Motivation
The deployment of China drone systems has experienced explosive growth across diverse sectors. In agricultural monitoring, China drone platforms equipped with high-resolution cameras enable precise crop health assessment and pest detection. In traffic management, China drone aerial imagery facilitates real-time vehicle counting and congestion analysis. During emergency rescue missions, China drone operators rely on accurate object detection to locate survivors and assess disaster damage. For infrastructure inspection, China drone technology provides cost-effective and efficient巡检 capabilities for power lines, bridges, and pipelines. Despite these promising applications, the accurate detection of small objects—such as pedestrians, vehicles, and traffic signs—in China drone aerial images remains a bottleneck that hinders full automation.
Small objects in aerial images typically occupy fewer than 32×32 pixels, resulting in extremely limited discriminative features. Furthermore, objects in China drone imagery often exhibit significant scale variations, arbitrary orientations, and dense spatial distributions. Occlusions caused by trees, buildings, or other objects further complicate the detection process. Conventional object detectors, including both two-stage methods like Faster R-CNN and one-stage methods like SSD, struggle to maintain satisfactory performance under these challenging conditions. The YOLO family of detectors, known for its excellent trade-off between inference speed and detection accuracy, has become the de facto standard for real-time applications. However, even the latest YOLOv11 model exhibits notable weaknesses when processing China drone aerial images, particularly in terms of missed detections and false positives for small and occluded targets.
To overcome these challenges, this work proposes a series of targeted improvements to the YOLOv11 architecture. First, we introduce the Convolutional Attention Fusion Module (CAFM) into the neck network to synergistically combine local convolutional features with global Transformer-based attention, thereby enhancing the model’s ability to capture weak small-object features. Second, we replace the standard detection head with the Dynamic Detection Head (DyHead), which employs a unified multi-dimensional attention mechanism to dynamically reweight features across scales, spatial locations, and tasks. Third, we adopt the WIoUv3 loss function, which utilizes a non-monotonic focusing mechanism to suppress harmful gradients from low-quality anchor boxes while focusing training on medium-quality samples. These innovations collectively yield significant improvements in detection performance for China drone aerial imagery.
Related Work and Background
Object Detection Paradigms
Object detection algorithms generally fall into two categories: two-stage detectors and one-stage detectors. Two-stage methods, such as R-CNN, Fast R-CNN, Faster R-CNN, and Mask R-CNN, first generate region proposals through a region proposal network and then perform classification and bounding box regression on these proposals. While two-stage detectors achieve high accuracy, their computational complexity often precludes real-time deployment on resource-constrained China drone platforms. One-stage detectors, including the SSD and YOLO families, directly predict object categories and locations from input images without explicit region proposal generation. These methods offer superior inference speed, making them ideal for real-time China drone applications, although they traditionally lag behind two-stage methods in accuracy for small object detection.
Evolution of the YOLO Series
The YOLO (You Only Look Once) series has undergone continuous evolution since its inception. YOLOv1 introduced the concept of treating object detection as a single regression problem. Subsequent versions, including YOLOv2, YOLOv3, YOLOv4, and YOLOv5, progressively improved accuracy and speed through architectural innovations such as batch normalization, anchor boxes, feature pyramid networks, and CSPNet backbones. YOLOv6 and YOLOv7 further optimized the network design for industrial applications. YOLOv8 introduced the C2f module and improved the neck structure. YOLOv9 and YOLOv10 pushed the boundaries of real-time detection with advanced training strategies and end-to-end optimization. YOLOv11, the latest iteration, replaces the C2f module with the C3k2 module for enhanced feature extraction, incorporates the C2PSA module with multi-head attention and feed-forward networks, and adopts depthwise separable convolutions (DWConv) in the detection head to reduce parameters and computational cost. However, despite these advancements, YOLOv11 still exhibits limitations when applied to China drone aerial image detection, particularly for small and occluded objects.
Small Object Detection in China Drone Imagery
Small object detection has emerged as a critical research focus within the computer vision community. In the context of China drone aerial imagery, small objects are characterized by low pixel counts, weak semantic information, and susceptibility to background clutter. Researchers have explored various strategies to address these challenges, including multi-scale feature fusion, attention mechanisms, data augmentation, and specialized loss functions. Several recent studies have proposed YOLO-based improvements for China drone detection tasks. For instance, deformable convolution has been introduced to adaptively sample features for small objects. Coordinate attention mechanisms have been employed to suppress background interference while enhancing feature extraction. Dynamic detection heads have been adopted to improve spatial and semantic understanding. Multi-scale convolutional attention and asynchronous feature pyramid networks have been combined to strengthen multi-scale fusion. Despite these efforts, there remains substantial room for improvement in terms of both accuracy and robustness for China drone aerial image detection. Our work builds upon these foundations and introduces novel enhancements that specifically target the unique challenges posed by China drone imagery.
Methodology: Enhanced YOLOv11 Architecture
Baseline YOLOv11 Architecture
The standard YOLOv11 architecture, as illustrated conceptually, comprises three main components: a backbone network for feature extraction, a neck network for multi-scale feature fusion, and a detection head for final prediction. The backbone utilizes the C3k2 module, which replaces the C2f module from YOLOv8, to improve feature extraction efficiency. The C2PSA module integrates multi-head self-attention with feed-forward networks to enhance non-linear feature representation. The neck network employs a feature pyramid structure (FPN) combined with a path aggregation network (PAN) for multi-scale feature integration. The detection head incorporates depthwise separable convolutions to reduce parameter count and computational complexity while maintaining detection performance.
Despite its strong performance on general object detection benchmarks, YOLOv11 faces three major challenges when applied to China drone aerial image detection: (1) small objects with weak features are easily overwhelmed by background noise during feature extraction; (2) dense object distributions lead to frequent occlusions and missed detections; (3) the standard loss function is insufficiently sensitive to small object localization errors. The following subsections detail our proposed improvements to address these limitations.
Convolutional Attention Fusion Module (CAFM)
To enhance the model’s ability to capture weak small-object features while maintaining global contextual awareness, we introduce the Convolutional Attention Fusion Module (CAFM) into the neck network of YOLOv11. The CAFM module synergistically combines the strengths of convolutional neural networks (CNNs) for local detail extraction and Transformers for global dependency modeling. This dual-branch architecture enables the model to simultaneously capture fine-grained local patterns and long-range contextual relationships, which is particularly beneficial for detecting small and occluded objects in China drone aerial images.
The CAFM module consists of two parallel branches: a local branch and a global branch. In the local branch, we first apply a 1×1 convolution to adjust the channel dimension, reducing computational overhead while concentrating information. The compressed feature is then divided into several groups along the channel dimension, and channel shuffling is performed to enable cross-group information exchange. Subsequently, a 3×3 depthwise separable convolution is applied to extract local spatial features. The local branch output can be expressed as:
$$
F_{conv} = \mathcal{W}_{3\times3\times3} \left( \text{CS} \left( \mathcal{W}_{1\times1} (Y) \right) \right)
$$
where \(F_{conv}\) denotes the output of the local branch, \(\mathcal{W}_{1\times1}\) and \(\mathcal{W}_{3\times3\times3}\) represent 1×1 and 3×3×3 convolutions respectively, \(Y\) is the input feature, and \(\text{CS}\) indicates the channel shuffle operation.
In the global branch, we first generate query (Q), key (K), and value (V) tensors through 1×1 convolution and 3×3 depthwise separable convolution, each with dimensions \(\hat{H} \times \hat{W} \times \hat{C}\). The Softmax function is applied to the reshaped Q and K tensors, yielding \(\hat{Q}\) and \(\hat{K}\) with dimensions \(\hat{H}\hat{W} \times \hat{C}\). The attention map is computed through the dot product of \(\hat{Q}\) and \(\hat{K}\). The global branch output \(F_{att}\) is given by:
$$
F_{att} = \mathcal{W}_{1\times1} \text{Attention}(\hat{Q}; \hat{K}; \hat{V}) + Y
$$
$$
\text{Attention}(\hat{Q}; \hat{K}; \hat{V}) = \hat{V} \cdot \text{Softmax}(\hat{Q}\hat{K}^T / \alpha)
$$
where \(\alpha\) is a learnable scaling parameter that controls the magnitude of the dot product before Softmax normalization. The final output of the CAFM module is obtained by element-wise addition of the local and global branch outputs:
$$
F_{out} = F_{att} + F_{conv}
$$
By integrating the CAFM module into the neck network, our model achieves enhanced feature representation that is both locally precise and globally coherent, significantly improving detection accuracy for small objects in China drone aerial images.
Dynamic Detection Head (DyHead)
The standard detection head in YOLOv11 employs a feature pyramid structure with limited cross-layer interaction, which is insufficient for handling the diverse scale variations and dense distributions of objects in China drone aerial imagery. To address this limitation, we replace the conventional detection head with the Dynamic Detection Head (DyHead), which unifies scale-aware, spatial-aware, and task-aware attention mechanisms through a multi-dimensional attention framework.
DyHead operates through three cascaded attention modules that dynamically adjust feature weights based on different perspectives:
Scale-aware attention dynamically adjusts feature weights for objects at different scales, enabling the model to emphasize appropriate feature levels for small, medium, and large objects. This is particularly important for China drone imagery, where object scales vary dramatically due to varying flight altitudes and camera perspectives.
Spatial-aware attention highlights critical spatial regions while suppressing background noise, effectively guiding the model to focus on areas with high object density or complex occlusion patterns. In China drone aerial scenes, this helps distinguish occluded objects from cluttered backgrounds.
Task-aware attention dynamically adjusts feature focus based on the specific requirements of classification and regression tasks, alleviating task conflicts and improving overall detection performance. This ensures that the model simultaneously optimizes for accurate category prediction and precise localization.
The overall operation of DyHead can be formulated as:
$$
W(F) = \pi_C \left( \pi_S \left( \pi_L (F) \times F \right) \times F \right) \times F
$$
where \(\pi_L\), \(\pi_S\), and \(\pi_C\) represent scale-aware, spatial-aware, and task-aware attention functions respectively, and \(F\) denotes the input feature tensor. By stacking these three attention mechanisms in a sequential manner, DyHead effectively reweights multi-scale features to enhance small object perception, suppress background interference, and balance task-specific requirements. The integration of DyHead into our YOLOv11 model significantly boosts detection accuracy for China drone aerial images without substantially increasing inference latency.
WIoUv3 Loss Function for Robust Regression
Bounding box regression is a critical component of object detection, and the choice of loss function substantially affects localization accuracy. The standard YOLOv11 employs the CIoU (Complete IoU) loss, which considers overlap area, center point distance, and aspect ratio. However, CIoU exhibits limited sensitivity to small objects and occlusion scenarios commonly encountered in China drone aerial imagery. To address this, we propose replacing CIoU with the WIoUv3 (Weighted IoU version 3) loss function, which introduces a non-monotonic focusing mechanism to dynamically reweight the contribution of each anchor box based on its quality.
WIoUv1 serves as the foundation for WIoUv2 and WIoUv3. It incorporates distance-based attention weighting, where the geometric penalty is automatically reduced when the anchor box already achieves high IoU with the target, preventing excessive intervention in well-trained samples. The WIoUv1 loss is defined as:
$$
L_{WIoUv1} = R_{WIoU} \cdot L_{IoU}
$$
$$
R_{WIoU} = \exp \left( \frac{(x – x_{gt})^2 + (y – y_{gt})^2}{W_g^2 + H_g^2} \right)
$$
$$
L_{IoU} = 1 – IoU
$$
where \(x\) and \(y\) are the coordinates of the anchor box center, \(x_{gt}\) and \(y_{gt}\) are the coordinates of the target box center, and \(W_g\) and \(H_g\) represent the width and height of the minimum enclosing bounding box.
WIoUv2 extends WIoUv1 by introducing a monotonic focusing coefficient \(L_{IoU}^*\) that reduces the weight of easy samples, forcing the model to concentrate on hard samples. To prevent the focusing coefficient from drifting during training, normalization is applied using the mean of \(L_{IoU}\):
$$
L_{WIoUv2} = \left( \frac{L_{IoU}^*}{\overline{L_{IoU}}} \right)^{\gamma} \cdot L_{WIoUv1}, \quad \gamma > 0
$$
WIoUv3 further refines the focusing mechanism by using an outlier degree \(\beta\) to evaluate anchor box quality. The outlier degree is defined as the ratio of \(L_{IoU}^*\) to the mean \(\overline{L_{IoU}}\):
$$
\beta = \frac{L_{IoU}^*}{\overline{L_{IoU}}} \in [0, +\infty)
$$
A non-monotonic focusing factor \(r\) is then constructed based on \(\beta\):
$$
r = \frac{\beta}{\delta \alpha^{\beta – \delta}}
$$
The final WIoUv3 loss is given by:
$$
L_{WIoUv3} = r \cdot L_{WIoUv1}
$$
The non-monotonic focusing mechanism assigns small gradient gains to both high-quality anchor boxes (small \(\beta\)) and extremely low-quality anchor boxes (large \(\beta\)), while focusing the training on medium-quality samples. This strategy effectively suppresses harmful gradients from outliers and stabilizes the training process, leading to improved localization accuracy for small and occluded objects in China drone aerial images. By adopting WIoUv3 as the bounding box regression loss, our model achieves more robust and precise localization, particularly under challenging conditions such as dense crowds, partial occlusions, and complex backgrounds.
Experimental Setup and Evaluation
Dataset: VisDrone2019
To evaluate the effectiveness of our proposed improvements, we conduct experiments on the VisDrone2019 benchmark dataset, which is specifically designed for China drone aerial image analysis. The dataset contains 8,629 images captured by China drone platforms under various environmental conditions, including different times of day, weather conditions, and illumination levels. The images cover a wide range of urban and suburban scenes, with annotations for 10 object categories: pedestrian, people, bicycle, car, van, truck, tricycle, awning-tricycle, bus, and motor. The dataset is divided into training, validation, and test sets following the standard protocol. The small object proportion in VisDrone2019 is exceptionally high, making it an ideal benchmark for evaluating small object detection algorithms for China drone applications.
Evaluation Metrics
We employ the following standard evaluation metrics to comprehensively assess detection performance:
| Metric | Symbol | Definition |
|---|---|---|
| Precision | P | TP / (TP + FP) |
| Recall | R | TP / (TP + FN) |
| Mean Average Precision (IoU=0.5) | mAP50 | Average of AP across all classes at IoU threshold 0.5 |
| Mean Average Precision (IoU=0.5:0.95) | mAP50-95 | Average of AP across all classes at IoU thresholds from 0.5 to 0.95 |
| Number of Parameters | Params | Total number of trainable parameters (in millions) |
| Giga Floating-Point Operations | GFLOPs | Computational complexity in billions of floating-point operations |
| Frames Per Second | FPS | Inference speed in frames per second |
Implementation Details
All experiments are conducted on a system running Windows 11 with an NVIDIA RTX 4060 GPU, PyTorch 2.0, and CUDA 11.8. Input images are resized to 640×640 pixels. We train for 300 epochs with a batch size of 8. The SGD optimizer is employed with an initial learning rate of 0.01. Standard data augmentation techniques, including random flips, mosaic augmentation, and color jittering, are applied during training.
Experimental Results and Analysis
Ablation Studies
To systematically evaluate the contribution of each proposed component, we conduct ablation experiments on the VisDrone2019 dataset. The baseline model is YOLOv11n, and we progressively add CAFM, DyHead, and WIoUv3 to measure their individual and combined impacts. The results are summarized in Table 2.
| Baseline | DyHead | CAFM | WIoUv3 | P (%) | R (%) | mAP50 (%) | mAP50-95 (%) | Params (M) | GFLOPs | FPS |
|---|---|---|---|---|---|---|---|---|---|---|
| ✓ | 45.8 | 34.6 | 35.0 | 20.5 | 25.8 | 6.3 | 155.3 | |||
| ✓ | ✓ | 48.7 | 36.0 | 37.1 | 21.8 | 31.0 | 7.5 | 94.9 | ||
| ✓ | ✓ | 47.0 | 34.6 | 35.2 | 20.5 | 35.0 | 10.7 | 90.2 | ||
| ✓ | ✓ | 46.3 | 35.3 | 35.7 | 20.6 | 25.8 | 6.3 | 153.8 | ||
| ✓ | ✓ | ✓ | ✓ | 50.3 | 37.4 | 38.4 | 22.5 | 39.9 | 11.5 | 64.2 |
The ablation results clearly demonstrate the effectiveness of each proposed component. Adding DyHead alone improves mAP50 by 2.1% (from 35.0% to 37.1%) and mAP50-95 by 1.3%, confirming that the multi-dimensional attention mechanism significantly enhances feature reweighting for China drone aerial objects. The CAFM module alone yields a modest improvement of 0.2% in mAP50, indicating that its full potential is realized when combined with other components. The WIoUv3 loss function alone improves mAP50 by 0.7% and mAP50-95 by 0.1%, validating the effectiveness of the non-monotonic focusing mechanism for robust regression. When all three components are combined, our model achieves substantial gains of 3.4% in mAP50 and 2.0% in mAP50-95 compared to the baseline YOLOv11n, while maintaining a real-time inference speed of 64.2 FPS. Although the parameter count increases by 14.1 M and GFLOPs increase by 5.2, the dramatic accuracy improvements justify these trade-offs for China drone detection applications where accuracy is paramount.
Comparison with State-of-the-Art Methods
We compare our enhanced YOLOv11 model with several state-of-the-art object detectors on the VisDrone2019 dataset, including YOLOv5n, YOLOv6n, YOLOv8n, YOLOv10n, and the baseline YOLOv11n. The results are presented in Table 3.
| Method | P (%) | R (%) | mAP50 (%) | mAP50-95 (%) | Params (M) | GFLOPs | FPS |
|---|---|---|---|---|---|---|---|
| YOLOv5n | 43.2 | 32.5 | 32.3 | 18.6 | 25.0 | 7.1 | 189.9 |
| YOLOv6n | 41.8 | 30.6 | 30.3 | 17.7 | 42.3 | 11.8 | 197.7 |
| YOLOv8n | 44.4 | 33.4 | 33.5 | 19.4 | 30.0 | 8.1 | 189.8 |
| YOLOv10n | 44.9 | 33.1 | 33.3 | 19.5 | 26.9 | 8.2 | 160.5 |
| YOLOv11n | 45.8 | 34.6 | 35.0 | 20.5 | 25.8 | 6.3 | 155.3 |
| Ours | 50.3 | 37.4 | 38.4 | 22.5 | 39.9 | 11.5 | 64.2 |
The comparison results demonstrate that our enhanced YOLOv11 model significantly outperforms all competing methods across all accuracy metrics. Specifically, our model achieves a mAP50 of 38.4%, which is 3.4% higher than the baseline YOLOv11n and 6.1% higher than YOLOv5n. The mAP50-95 of 22.5% represents improvements of 2.0% over YOLOv11n and 3.9% over YOLOv5n. Precision and recall are also substantially improved, reaching 50.3% and 37.4% respectively. While our model has higher parameter count (39.9 M) and GFLOPs (11.5) compared to lighter models, the inference speed of 64.2 FPS remains well above the real-time threshold of 30 FPS, making it suitable for practical China drone deployment. The superior accuracy of our model is attributed to the synergistic integration of CAFM, DyHead, and WIoUv3, which collectively enhance feature representation, adaptive attention, and robust regression for small object detection in China drone aerial imagery.
Per-Class Performance Analysis
To provide deeper insights into the model’s detection capabilities, we analyze the per-class average precision (AP) for all 10 categories in the VisDrone2019 dataset. Table 4 presents the comparison between our enhanced model and the baseline YOLOv11n.
| Category | YOLOv11n (%) | Ours (%) | Improvement (%) |
|---|---|---|---|
| Pedestrian | 28.4 | 32.1 | +3.7 |
| People | 24.6 | 27.8 | +3.2 |
| Bicycle | 18.9 | 21.5 | +2.6 |
| Car | 52.3 | 55.6 | +3.3 |
| Van | 40.1 | 43.2 | +3.1 |
| Truck | 36.7 | 39.4 | +2.7 |
| Tricycle | 30.2 | 33.8 | +3.6 |
| Awning-tricycle | 27.5 | 30.9 | +3.4 |
| Bus | 51.8 | 54.7 | +2.9 |
| Motor | 35.1 | 38.3 | +3.2 |
The per-class analysis reveals consistent improvements across all categories, with particularly notable gains for small and frequently occluded objects such as pedestrians (+3.7%), tricycles (+3.6%), and awning-tricycles (+3.4%). These improvements validate the effectiveness of our proposed enhancements in addressing the core challenges of small object detection and occlusion handling in China drone aerial imagery. The smallest improvement is observed for bicycles (+2.6%), which may be attributed to their highly variable shapes and poses. Nevertheless, the uniform positive gains across all categories demonstrate the robustness and generalizability of our approach.
The integration of CAFM in the neck network enables better fusion of local and global features, which is particularly beneficial for small objects that rely on contextual information for accurate detection. The DyHead’s multi-dimensional attention mechanism dynamically emphasizes spatial regions containing small objects while suppressing background clutter, leading to fewer false positives. The WIoUv3 loss function improves localization precision for occluded objects by focusing training on medium-quality anchor boxes and reducing the influence of outliers. Collectively, these innovations enable our model to achieve state-of-the-art performance on the VisDrone2019 benchmark, making it a compelling choice for real-world China drone detection applications.
Inference Speed Analysis
For practical deployment on China drone platforms, inference speed is a critical consideration alongside detection accuracy. Table 5 presents the computational efficiency metrics of our model compared to the baseline.
| Model | Params (M) | GFLOPs | FPS | mAP50 (%) |
|---|---|---|---|---|
| YOLOv11n | 25.8 | 6.3 | 155.3 | 35.0 |
| YOLOv11n + CAFM | 35.0 | 10.7 | 90.2 | 35.2 |
| YOLOv11n + DyHead | 31.0 | 7.5 | 94.9 | 37.1 |
| YOLOv11n + WIoUv3 | 25.8 | 6.3 | 153.8 | 35.7 |
| Ours (Full) | 39.9 | 11.5 | 64.2 | 38.4 |
Although our full model exhibits a 55% increase in parameter count and a 83% increase in GFLOPs compared to YOLOv11n, the inference speed of 64.2 FPS still comfortably exceeds the real-time requirement of 30 FPS for most China drone applications. The trade-off between computational cost and accuracy improvement is highly favorable, as the 3.4% gain in mAP50 represents a substantial enhancement in detection capability. For China drone deployment scenarios where computational resources are constrained, the individual components (particularly DyHead and WIoUv3) offer attractive accuracy gains with modest computational overhead, as shown in Table 5.
Qualitative Analysis and Discussion
Beyond quantitative evaluation, we analyze the qualitative improvements achieved by our enhanced model. In challenging scenarios commonly encountered in China drone aerial imagery—such as nighttime conditions, dense crowds, and severe occlusions—our model demonstrates notably fewer missed detections and false positives compared to the baseline YOLOv11n. The CAFM module’s ability to capture both local details and global context is particularly beneficial for distinguishing objects from complex backgrounds. The DyHead’s spatial-aware attention mechanism effectively highlights object-dense regions while suppressing irrelevant areas. The WIoUv3 loss function contributes to tighter bounding box alignment, especially for partially occluded objects.
The improvements are most pronounced for small-scale objects that occupy fewer than 32×32 pixels in the image. In such cases, the baseline YOLOv11n frequently fails to detect the object or produces inaccurate bounding boxes. Our enhanced model, through the synergistic combination of CAFM, DyHead, and WIoUv3, achieves reliable detection with precise localization. This is critical for practical China drone applications such as pedestrian detection in crowded streets, vehicle counting in dense traffic, and small object identification in complex environments.
Computational Complexity Analysis
To provide a theoretical understanding of the computational requirements, we analyze the complexity of each proposed module. For an input feature tensor of dimensions \(C \times H \times W\), the CAFM module has a computational complexity of approximately:
$$
O_{CAFM} = O\left(C^2 \cdot H \cdot W + C \cdot H \cdot W \cdot k^2\right)
$$
where \(k=3\) is the kernel size of the depthwise separable convolution. The first term corresponds to the 1×1 convolution and attention operations, while the second term represents the depthwise convolution.
The DyHead complexity is dominated by the three attention modules:
$$
O_{DyHead} = O\left(C^2 \cdot H \cdot W \cdot S\right)
$$
where \(S\) is the number of scales in the feature pyramid. The WIoUv3 loss function introduces negligible computational overhead during training and no additional cost during inference.
The total computational complexity of our enhanced model remains manageable for real-time deployment on modern GPU-equipped China drone platforms, as confirmed by our empirical FPS measurements.
Limitations and Future Work
Despite the significant improvements achieved, our enhanced YOLOv11 model has several limitations that warrant further investigation. First, the increased parameter count and computational complexity may pose challenges for deployment on resource-constrained edge devices commonly used in China drone systems. Future work will focus on model compression techniques such as knowledge distillation, network pruning, and quantization to reduce the model footprint while preserving accuracy. Second, the current improvements are primarily focused on the neck and head components of the network; further gains may be achievable by redesigning the backbone specifically for small object detection in China drone imagery. Third, the model’s performance under extreme weather conditions (e.g., heavy rain, fog, snow) has not been systematically evaluated, and domain adaptation techniques may be required to ensure robust operation in diverse environmental conditions. Fourth, the computational overhead of the CAFM module, particularly the global attention branch, could be reduced through efficient attention mechanisms such as linear attention or window-based attention. Addressing these limitations will be the focus of our ongoing research.
Conclusion
In this work, we have presented a comprehensive enhancement to the YOLOv11 model specifically designed for China drone aerial image detection. By integrating three novel components—the Convolutional Attention Fusion Module (CAFM) for synergistic local-global feature extraction, the Dynamic Detection Head (DyHead) for multi-dimensional adaptive attention, and the WIoUv3 loss function for robust bounding box regression with non-monotonic focusing—we have achieved substantial improvements in detection accuracy for small and occluded objects. Extensive experiments on the VisDrone2019 benchmark dataset demonstrate that our enhanced model outperforms the baseline YOLOv11n by 3.4% in mAP50 and 2.0% in mAP50-95, while maintaining real-time inference speed of 64.2 FPS. Per-class analysis reveals consistent gains across all 10 object categories, with particularly notable improvements for small and frequently occluded objects. The proposed framework represents a significant step toward reliable and accurate object detection for China drone applications, with direct implications for agriculture, traffic management, emergency response, and infrastructure inspection. Future research will explore model lightweighting techniques and domain adaptation strategies to further enhance the practical deployment of our approach on resource-constrained China drone platforms.

