The rapid advancement of artificial intelligence and aerospace technology has propelled low-altitude autonomous perception systems to become a pivotal component of intelligent unmanned platforms. In this context, China UAV drone technology, characterized by its high mobility, cost-effectiveness, and adaptability to diverse scenarios, has found extensive applications in information perception tasks within high-dynamic environments such as disaster assessment, intelligent security, precision agriculture, and traffic monitoring. Among these tasks, small object detection stands as a core challenge for visual perception systems on China UAV drone platforms, playing a crucial role in the system’s ability to discover and respond to critical targets in complex environments.

In typical China UAV drone perspectives, such as low-altitude remote sensing or oblique photography, objects often exhibit characteristics like small scale, blurred contours, strong background interference, and dense overlaps. These factors pose significant challenges to the representational capacity of traditional object detection models. A primary issue is the loss of spatial information during the down-sampling processes prevalent in mainstream detection frameworks, which leads to a marked decline in the localization and classification accuracy for small-scale targets. This, in turn, results in frequent false positives and missed detections. Therefore, achieving a balance between detection accuracy and model efficiency—maintaining a compact model size suitable for deployment on resource-constrained China UAV drone platforms—has emerged as a critical research focus and a persistent challenge in the field of UAV vision.
Object detection has seen substantial progress in recent years, with mainstream algorithms broadly categorized into two types: Two-stage and One-stage methods. Two-stage methods, represented by the R-CNN series, first generate a series of region proposals via heuristic methods or a CNN, then perform classification and regression on these proposals. While these methods often achieve high accuracy, their speed generally falls short of real-time requirements. In contrast, One-stage methods, exemplified by the YOLO (You Only Look Once) series, offer significantly faster inference speeds. They perform a single pass over the image, using a CNN to extract features and directly predict bounding boxes and class probabilities. Although One-stage methods may slightly lag in precision compared to their Two-stage counterparts, their lightweight and efficient nature makes them more suitable for the computational constraints typical of China UAV drone applications, where real-time performance, energy consumption, and limited onboard processing power are paramount. Among One-stage detectors, the YOLO series stands out for its efficient, end-to-end architecture. Its continuous evolution through versions like YOLOv5, YOLOv8, and YOLOv11 has not only improved detection accuracy but also gradually enhanced its capability to perceive small objects, making it a strong foundation for China UAV drone aerial image analysis.
However, for the specific task of China UAV drone small object detection, targets exhibit drastic scale variations, are set against complex and interfering backgrounds, and are captured under diverse imaging conditions. These factors make it difficult for existing methods to guarantee high recognition rates and detection precision. To tackle these inherent challenges, we propose HMD-YOLO, an enhanced small object detection algorithm based on the YOLOv11s architecture. The contributions of our work are fourfold:
- We design a novel High-Resolution Multi-Scale Convolutional Attention (HR-MSCA) module. This module optimizes the original three-head detection structure by adding a shallower detection head (P2) for higher spatial resolution and embedding a Multi-Scale Convolutional Attention (MSCA) mechanism. This design enhances the retention of small object details while maintaining model efficiency, thereby improving detection capability in complex China UAV drone scenarios.
- We employ a lightweight and efficient dynamic upsampler, termed Litesample, to replace the conventional upsampling module in the model’s neck. This significantly improves feature recovery and edge modeling capabilities for small targets in China UAV drone imagery while conserving computational resources.
- We introduce the Wise-IoU (WIoU) loss function, which is based on a dynamic non-monotonic focusing mechanism. This replacement for the standard bounding box regression loss improves training efficiency and enhances the model’s final detection performance, particularly for difficult samples common in China UAV drone datasets.
- We integrate a Dynamic Detection Head (DyHead) to unify and enhance the detection heads, further strengthening the model’s ability to localize and identify small objects, leading to higher precision in small target detection.
Network Architecture of the Proposed HMD-YOLO Model
The overall architecture of our proposed HMD-YOLO model is built upon the YOLOv11s backbone. The core modifications are strategically integrated to address the specific difficulties of China UAV drone small object detection. The primary enhancements include the HR-MSCA module in the backbone/neck for better feature extraction, the Litesample upsampler for efficient feature map resolution enhancement, the Wise-IoU loss function for robust bounding box regression, and the Dynamic Head for adaptive feature refinement. These components work synergistically to boost performance on small, dense objects against cluttered backgrounds typical of China UAV drone operations.
1. HR-MSCA Module
A major reason for poor small object detection is their limited pixel count, resulting in insufficient feature information for the model to learn effectively. The substantial down-sampling stride in YOLOv11 leads to deep feature maps with low spatial resolution, making it arduous to learn discriminative features for tiny objects. Furthermore, the standard convolutional operations in YOLOv11 have a relatively coarse ability to handle spatial features, which can lead to missed detections and low accuracy for the minuscule targets prevalent in China UAV drone imagery. To mitigate these issues, we designed the HR-MSCA module, which incorporates two key branches.
First, we modify the detection head structure. The original YOLOv11s employs a three-head design: P3 (80×80), P4 (40×40), and P5 (20×20) for detecting small, medium, and large objects, respectively. While the P3 head can theoretically detect objects down to approximately 8×8 pixels, many targets in China UAV drone views are significantly smaller. Moreover, large objects are relatively rare from a high-altitude drone perspective, making the P5 head less critical. Therefore, we introduce a new, higher-resolution P2 detection head with a 160×160 feature map and remove the P5 head. This shifts the model’s focus towards detecting smaller objects and increases the granularity of spatial information available for detection.
Second, we incorporate a Multi-Scale Convolutional Attention (MSCA) mechanism to combat false positives and missed detections caused by complex backgrounds. MSCA captures multi-scale contextual information through multiple branches with different kernel sizes, significantly boosting the model’s sensitivity to small and scale-varying targets. It utilizes depthwise separable and strip convolutions to maintain performance while reducing computational complexity. Crucially, MSCA generates an attention map via element-wise multiplication, allowing the model to adaptively focus on target regions and suppress background interference—a vital capability for China UAV drone image analysis. The MSCA structure can be mathematically represented as follows:
$$ M = \text{DWConv}_{5\times5}(\text{Input}) $$
$$ M_1 = \text{DWConv}_{7\times1}(\text{DWConv}_{1\times7}(M)) $$
$$ M_2 = \text{DWConv}_{11\times1}(\text{DWConv}_{1\times11}(M)) $$
$$ M_3 = \text{DWConv}_{21\times1}(\text{DWConv}_{1\times21}(M)) $$
$$ \text{Output} = \text{DWConv}_{1\times1}\left(M + \sum_{i=1}^{3} M_i\right) \otimes \text{Input} $$
Here, Input is the input feature map, DWConv_{i×j} denotes a depthwise separable convolution, and $M_1$, $M_2$, $M_3$ represent features processed by different effective receptive fields (equivalent to 7×7, 11×11, 21×21) achieved via cascaded strip convolutions. The final step fuses the multi-scale features, applies a 1×1 convolution for channel integration to produce an attention map, and then performs element-wise multiplication with the original input. This emphasizes key regions while preserving original feature information, thereby enhancing the model’s expression and recognition capability for objects of varying scales in China UAV drone scenes.
2. Litesample Dynamic Upsampler
The upsampling module is crucial in object detection, especially for small targets, as it restores low-resolution feature maps to higher resolutions. However, YOLOv11 employs traditional nearest-neighbor interpolation, which relies solely on pixel position without any semantic understanding or reasoning. This limits the model’s ability to capture fine details, often resulting in粗糙、模糊不清的预测边界 for small China UAV drone targets. While existing dynamic convolution-based upsamplers like CARAFE, FADE, and SAPA offer improved accuracy, their reliance on computationally heavy dynamic convolutions and additional sub-networks introduces significant latency, making them less suitable for real-time requirements on China UAV drone platforms.
To address this, we adopt the Litesample upsampler. Its core idea is to replace traditional dynamic convolution with dynamic sampling points, simplifying the process and improving efficiency. Let $X \in \mathbb{R}^{H_1 \times W_1 \times C}$ be the low-resolution input feature map. Litesample first generates a dynamic range factor $\alpha$ to control the sampling range via a linear layer and Sigmoid activation:
$$ \alpha = \sigma(\text{linear}_1(X)) \in [0, 1] $$
This is then scaled to constrain the range to $[0, 0.5]$. Another linear layer projects $X$ to initial offsets $O_{\text{init}} \in \mathbb{R}^{2s^2 \times H \times W}$, where $s$ is the upsampling scale factor. The dynamic range factor modulates these initial offsets:
$$ O_{\text{dyn}} = \text{PixelShuffle}(\alpha \cdot O_{\text{init}}) \in \mathbb{R}^{2 \times sH \times sW} $$
The dynamic offsets $O_{\text{dyn}}$ are added to a standard bilinear grid $G$ to form the dynamic sampling coordinates $S$:
$$ S = O_{\text{dyn}} + G $$
Finally, the high-resolution output feature map $X’ \in \mathbb{R}^{C \times sH \times sW}$ is obtained by sampling from $X$ using the coordinates $S$:
$$ X’ = \text{grid\_sample}(X, S) $$
Litesample captures the benefits of dynamic upsampling while avoiding high complexity. By learning sampling point locations instead of convolutional kernels, it provides an effective and efficient upsampling solution well-suited for the computational constraints of China UAV drone systems.
3. Wise-IoU (WIoU) Loss Function
The design of the loss function is critical to detection performance. Traditional bounding box regression losses like CIoU, EIoU, and SIoU, used in YOLOv11, assume high-quality training examples and focus on improving regression fitness. However, China UAV drone datasets often contain a high proportion of low-quality examples—tiny, blurred, occluded, or densely packed objects. Forcing the model to heavily regress such low-quality examples can degrade its generalization. To address this, we replace the default loss with Wise-IoU (WIoU) v3, which employs a dynamic non-monotonic focusing mechanism. It uses an “outlier degree” instead of IoU to assess anchor box quality, allocating gradient gains more reasonably. This strategy reduces the competitive advantage of high-quality anchors and minimizes the harmful influence of low-quality examples, allowing the model to focus on ordinary-quality anchors and improve overall robustness for China UAV drone detection. The formulation is as follows:
$$ \mathcal{L}_{\text{IoU}} = 1 – R_{\text{IoU}} $$
$$ R_{\text{WIoU}} = \exp\left(\frac{(x – x_{gt})^2 + (y – y_{gt})^2}{(W_g^2 + H_g^2)^*}\right) $$
$$ \mathcal{L}_{\text{WIoU v1}} = R_{\text{WIoU}} \mathcal{L}_{\text{IoU}} $$
$$ \beta = \frac{\mathcal{L}_{\text{IoU}}^*}{\overline{\mathcal{L}_{\text{IoU}}}} $$
$$ r = \frac{\beta}{\delta \alpha^{\beta – \delta}} $$
$$ \mathcal{L}_{\text{WIoU v3}} = r \mathcal{L}_{\text{WIoU v1}} $$
Here, $\beta$ represents the outlier degree, $\alpha$ and $\delta$ are hyperparameters, $(x, y)$ and $(x_{gt}, y_{gt})$ are the predicted and ground-truth bounding box center coordinates, $W_g$ and $H_g$ are the width and height of the smallest enclosing box, $r$ is the non-monotonic weighting factor, $*$ denotes a detach operation to stop gradient flow, $R_{\text{WIoU}}$ is a distance-aware attention term, $\mathcal{L}_{\text{IoU}}$ is the base IoU loss, and $\overline{\mathcal{L}_{\text{IoU}}}$ is a running mean of the IoU loss. WIoU’s dynamic focusing mechanism effectively suppresses the adverse gradient effects from both very high-quality and very low-quality samples, directing optimization efforts towards medium-quality samples, which leads to better generalization and more precise bounding box regression for challenging China UAV drone targets.
4. Dynamic Detection Head (DyHead)
The detection head’s feature modeling capability is paramount for final performance. The standard YOLOv11 head uses a fixed convolutional stack with a local receptive field, uniformly processing objects of all sizes. While efficient, this is suboptimal for the extreme scale variations and background clutter in China UAV drone imagery. We integrate the Dynamic Head to replace the native detection heads. DyHead unifiedly models attention across three dimensions: scale, spatial, and channel. For an input feature tensor $F \in \mathbb{R}^{L \times S \times C}$ (Level, Space, Channel), the DyHead operation is:
$$ W(F) = \pi_C(\pi_S(\pi_L(F) \cdot F) \cdot F) \cdot F $$
where $\pi_L$, $\pi_S$, and $\pi_C$ represent scale-aware, spatial-aware, and task-aware attention, respectively.
- Scale-aware attention $\pi_L$ adaptively fuses multi-scale features (e.g., from P2, P3, P4) via global average pooling and a small conv layer.
- Spatial-aware attention $\pi_S$ focuses on discriminative spatial locations through deformable convolution and learned offsets.
- Task-aware attention $\pi_C$ dynamically adjusts channel-wise features using a dynamic gate mechanism.
We stack three DyHead blocks, a configuration found to balance accuracy gains and computational cost effectively for China UAV drone detection tasks.
Experiments and Analysis
1. Experimental Setup and Datasets
Our experiments are based on PyTorch 2.6.0, using YOLOv11s as the baseline. The primary dataset is VisDrone2019, a large-scale benchmark collected by the AISKYEYE team using China UAV drone platforms. It contains 8,599 static images captured from various urban and suburban environments, with 6,471 for training, 548 for validation, and 1,610 for testing. It includes 10 object categories common in China UAV drone surveillance: pedestrian, people, bicycle, car, van, truck, tricycle, awning-tricycle, bus, and motor. To validate generalization, we also test on the TinyPerson dataset, which focuses on extremely small, distant human targets, presenting an additional challenge.
The training parameters are summarized below:
| Parameter | Value |
|---|---|
| Input Size | 640 × 640 |
| Batch Size | 4 |
| Initial Learning Rate (lr0) | 0.01 |
| Final Learning Rate (lrf) | 0.01 |
| Optimizer | SGD |
| Momentum | 0.937 |
| Epochs | 200 |
| Workers | 8 |
2. Evaluation Metrics
We use standard object detection metrics: Precision (P), Recall (R), mean Average Precision at IoU threshold 0.5 (mAP@0.5), and mean Average Precision averaged over IoU thresholds from 0.5 to 0.95 with a 0.05 step (mAP@0.5:0.95). We also report model size via Parameters (Params) and computational complexity via Giga Floating Point Operations (GFLOPs).
$$ \text{Precision} = \frac{TP}{TP + FP} $$
$$ \text{Recall} = \frac{TP}{TP + FN} $$
$$ \text{AP} = \int_{0}^{1} P(R) \, dR $$
$$ \text{mAP} = \frac{1}{N} \sum_{i=1}^{N} \text{AP}_i $$
3. Comparative Experimental Analysis
We compare HMD-YOLO against several state-of-the-art detectors on the VisDrone2019 dataset. The results, detailed in the table below, demonstrate the effectiveness of our proposed model for China UAV drone-based detection.
| Model | P (%) | R (%) | mAP@0.5 (%) | mAP@0.5:0.95 (%) | Params (M) | GFLOPs (G) |
|---|---|---|---|---|---|---|
| Fast R-CNN | 31.2 | 23.6 | 21.6 | 11.9 | 41.5 | 201.3 |
| RetinaNet | 20.5 | 10.2 | 29.3 | 18.6 | 18.9 | 92.8 |
| YOLOv5s | 45.3 | 35.1 | 37.2 | 21.9 | 9.21 | 23.2 |
| YOLOv7-tiny | 43.1 | 33.4 | 35.6 | 19.2 | 5.96 | 12.8 |
| YOLOv8n | 42.4 | 31.2 | 31.6 | 17.5 | 3.12 | 8.3 |
| YOLOv8s | 47.8 | 38.9 | 38.6 | 23.5 | 11.21 | 28.9 |
| YOLOv11n | 43.2 | 33.2 | 33.1 | 20.3 | 2.59 | 6.2 |
| YOLOv11s (Baseline) | 48.0 | 37.2 | 37.8 | 22.5 | 9.4 | 21.3 |
| WT-YOLO | 54.4 | – | 46.2 | 28.2 | 6.9 | 27.6 |
| HM-YOLO | 56.5 | 43.4 | 46.2 | 28.4 | 8.3 | 35.0 |
| HMD-YOLO (Ours) | 57.9 | 47.1 | 49.9 | 30.7 | 8.7 | 26.7 |
Our HMD-YOLO achieves the highest scores in Precision, Recall, mAP@0.5, and mAP@0.5:0.95 among the compared models. Notably, it surpasses the baseline YOLOv11s by significant margins of +12.1% in mAP@0.5 and +8.2% in mAP@0.5:0.95, while also reducing the parameter count slightly. It also outperforms other recent China UAV drone-specific models like WT-YOLO and HM-YOLO in accuracy metrics while maintaining a lower computational cost than HM-YOLO, demonstrating an excellent balance of performance and efficiency for deployment on China UAV drone platforms.
4. Ablation Study
We conduct ablation studies to validate the contribution of each proposed component. The results are systematically presented in the following table.
| Configuration | P (%) | R (%) | mAP@0.5 (%) | mAP@0.5:0.95 (%) | Params (M) |
|---|---|---|---|---|---|
| YOLOv11s (Baseline) | 48.0 | 37.2 | 37.8 | 22.5 | 9.4 |
| + HR-MSCA | 52.5 | 41.2 | 43.0 | 26.1 | 7.6 |
| + Litesample | 48.6 | 37.6 | 38.9 | 23.3 | 9.7 |
| + WIoU | 49.7 | 38.1 | 38.9 | 23.2 | 9.4 |
| + DyHead | 51.7 | 38.7 | 40.7 | 24.6 | 8.6 |
| + HR-MSCA + WIoU | 53.0 | 41.5 | 43.4 | 26.1 | 7.6 |
| + HR-MSCA + WIoU + Litesample | 53.2 | 41.5 | 43.8 | 26.5 | 7.7 |
| HMD-YOLO (All) | 57.9 | 47.1 | 49.9 | 30.7 | 8.7 |
The ablation study clearly shows the incremental benefits of each module. The HR-MSCA module provides the most substantial individual boost (+5.2% mAP@0.5), validating its importance for small object detection in China UAV drone imagery. The Litesample and WIoU modules offer further precision and recall improvements. The DyHead contributes notably to accuracy. When all components are combined in HMD-YOLO, they work synergistically to deliver the best overall performance, achieving a 57.9% Precision, 47.1% Recall, 49.9% mAP@0.5, and 30.7% mAP@0.5:0.95, with a model size of only 8.7M parameters.
5. Generalization Validation on TinyPerson
To verify the robustness and general applicability of our approach beyond the VisDrone2019 dataset, we evaluate HMD-YOLO on the TinyPerson dataset. The results confirm strong generalization capability.
| Model | P (%) | R (%) | mAP@0.5 (%) | mAP@0.5:0.95 (%) |
|---|---|---|---|---|
| YOLOv11s | 33.6 | 21.6 | 20.3 | 6.3 |
| HMD-YOLO | 37.5 | 30.1 | 26.2 | 8.7 |
HMD-YOLO consistently outperforms the baseline on this challenging dataset of extremely small persons, demonstrating that the improvements are not dataset-specific and are effective for the core problem of tiny object detection relevant to long-range China UAV drone surveillance.
Conclusion and Future Work
In this paper, we presented HMD-YOLO, an enhanced object detection model designed to address the specific challenges of small, dense, and occluded object detection in China UAV drone aerial imagery. By integrating a novel HR-MSCA module for multi-scale high-resolution feature enhancement, an efficient Litesample upsampler, a robust Wise-IoU loss function, and a powerful Dynamic Detection Head, the proposed model achieves a significant performance boost over the strong YOLOv11s baseline and other state-of-the-art methods. Comprehensive experiments on the VisDrone2019 benchmark and the TinyPerson dataset validate the effectiveness and generalization ability of our approach. The model maintains a compact size, making it suitable for deployment on resource-constrained China UAV drone platforms where real-time performance is essential.
Future work will focus on further optimizing the model’s architecture to achieve greater lightweighting without compromising accuracy, exploring knowledge distillation or neural architecture search techniques to develop even more efficient variants specifically tailored for edge deployment on China UAV drone systems. Additionally, extending the framework to handle video sequences for tracking small objects across frames in China UAV drone videos is a promising direction for enhancing situational awareness in dynamic scenarios.
