In recent years, the widespread deployment of UAV drones in industrial inspection, agricultural monitoring, power line patrol, emergency rescue, and military reconnaissance has generated massive amounts of aerial imagery. However, detecting small objects in such complex scenes remains a formidable challenge due to their low pixel resolution, dense distribution, strong background interference, and the limited computational resources available on UAV platforms. To address these issues, we propose YOLO-DoS, a lightweight detection algorithm built upon the YOLO11 framework. Our approach integrates a multi-scale nonlinear feature collaborative attention module, a decoupled fully connected progressive semantic injection fusion network, a lightweight detection head, a shape-normalized Wasserstein distance loss, and a knowledge distillation strategy. Extensive experiments on the VisDrone2019 dataset demonstrate that YOLO-DoS achieves a significant improvement in detection accuracy while maintaining a compact model size suitable for real-time inference on UAV drones.

1. Introduction
The rapid growth of UAV drone applications has led to the generation of large-scale aerial image datasets. Unlike general object detection scenarios, UAV drone images are characterized by small object scales, severe occlusion, cluttered backgrounds, and variable lighting conditions. Traditional deep learning-based detectors, such as two-stage Faster R-CNN and single-stage YOLO variants, often struggle to extract discriminative features from tiny targets under these constraints. Moreover, the stringent storage and power limitations of UAV embedded systems demand extremely lightweight models without sacrificing accuracy.
To bridge this gap, we propose YOLO-DoS, which incorporates the following key innovations:
- A Ghost-MSNFCA (Multi-Scale Nonlinear Feature Coordinate Attention) module that enhances target contour features and suppresses background noise through multi-scale nonlinear enhancement and dual-domain collaborative attention.
- A DFC-PSIENet (Decoupled Fully Connected Progressive Semantic Injection Fusion Network) that strengthens semantic expression for small and occluded objects during feature fusion.
- A lightweight and efficient detection head (LAED) using Ghost modules and grouped convolutions to reduce computational cost.
- A Shape-NWD loss function that combines normalized Wasserstein distance with shape-aware weighting for more accurate bounding box regression.
- A knowledge distillation framework that leverages a larger teacher model to further improve the student network’s performance without increasing inference overhead.
2. Related Work
Object detection in UAV drone imagery has attracted considerable attention. Recent works have attempted to modify YOLO architectures by introducing attention mechanisms, multi-scale feature pyramids, or lightweight backbones. For instance, some researchers replaced traditional convolutions with Ghost modules or depthwise separable convolutions to reduce parameters, while others added extra high-resolution detection layers to better handle small objects. However, most existing methods either suffer from insufficient feature extraction in cluttered backgrounds or impose excessive computational demands unsuitable for edge deployment. Our work aims to achieve a balanced trade-off by designing a set of complementary components that collectively enhance small object detection on UAV drones.
3. Methodology
3.1 Ghost-MSNFCA Module
The backbone network employed in YOLO-DoS is based on GhostNetV3, which offers an excellent compromise between efficiency and representational power. To further improve its capability in capturing weak and tiny objects, we replace the standard Ghost blocks with our proposed Ghost-MSNFCA module. The core of this module consists of two sub-modules: the Multi-Scale Nonlinear Feature Enhancement (MSNFE) and the Dual-domain Coordinate Attention Mechanism (DCAM).
The MSNFE first splits the input feature map into four branches, each processed by grouped convolutions with kernel sizes of 3, 5, 7, and 9 respectively. This multi-scale design allows the network to capture both fine local details and broader contextual information. Each branch then passes through an Enhancement Unit that applies a 1×1 pointwise convolution (with ReLU) followed by another 1×1 convolution with sigmoid activation to generate nonlinear attention weights. The enhanced feature is obtained by element-wise multiplication and residual addition:
$$ F’_i = F_i + F_i \odot F_i^w, \quad i=0,1,2,3 $$
After concatenating all branches, the combined feature is fed into the DCAM, which operates in parallel through two attention paths: the Spatial Attention Module (SAM) and the Frequency Attention Module (FAM). SAM aggregates global average and max pooling results, then applies a 7×7 convolution and sigmoid to produce a spatial attention map. FAM extracts high-frequency components by subtracting a low-frequency version (obtained via average pooling) from the original feature, emphasizing object boundaries. The final output of DCAM is computed as:
$$ Y_i = \text{DConv}_3(F_i) \otimes P_i \otimes H_i, \quad i=0,1,2,3 $$
where \(\otimes\) denotes element-wise multiplication, \(P_i\) is the spatial attention weight, and \(H_i\) is the frequency attention weight. The multi-scale features from all branches are then concatenated, followed by a 1×1 convolution to produce the final enhanced feature map. This design effectively strengthens target contour information while suppressing irrelevant background, which is critical for detecting small objects in UAV drone images.
3.2 DFC-PSIENet Feature Fusion Network
Traditional PANet-based feature fusion in YOLO11 often loses discriminative details for small objects due to simple concatenation and downsampling operations. To overcome this, we construct DFC-PSIENet, which replaces the original 80×80 detection layer with a high-resolution 160×160 layer, preserving more spatial details for tiny targets. The neck network sequentially processes features using the Decoupled Fully Connected (DFC) attention mechanism and a Progressive Semantic Infusion Enhancement Network (PSIENet).
In the DFC module, the input feature is first downsampled by a factor of 2 using average pooling to reduce computational cost. Then, depthwise separable convolutions with 1×9 and 9×1 kernels are applied along the horizontal and vertical directions to capture long-range spatial dependencies. After applying a sigmoid activation, the weights are upsampled via bilinear interpolation and multiplied with the original input to produce the reweighted feature:
$$ F_w = \text{BL}(A) \otimes f_{in} $$
The PSIENet employs a cross-scale CspModule, SPD convolutions, and bilinear interpolation to progressively inject high-level semantic information into low-level features. For a given target layer with resolution \((H_i, W_i)\), features from adjacent layers are resized to match its dimensions using either SPD (for coarser layers) or bilinear interpolation (for finer layers). After applying a 3×3 smooth convolution for refinement, all aligned feature maps are fused via Hadamard product:
$$ f_i^4 = H\left( \left[ f_{j \rightarrow i}^3, f_{j \rightarrow i}^3, f_{j \rightarrow i}^3 \right] \right) $$
This mechanism enhances the semantic expressiveness of shallow features while preserving high-resolution spatial information, significantly improving the detection of small and occluded objects in complex UAV drone scenes.
3.3 Lightweight and Efficient Detection Head (LAED)
The original YOLO11 decoupled head uses standard 3×3 convolutions in both the classification and regression branches, leading to high parameter counts. We observe that the regression branch focuses on geometric structure and can tolerate more redundant information, while the classification branch relies more on deep semantics. Therefore, we replace the standard convolutions accordingly: in the regression branch, we use a Ghost module that first halves the channels via a 1×1 convolution and then applies cheap 3×3 convolution operations to generate additional feature maps; in the classification branch, we employ grouped convolutions with a group size of 16. The parameter counts for each branch are:
$$ P_{\text{ghost}} = C_{in} \times \frac{C_{out}}{2} + C_{in} \times \frac{C_{out}}{4} \times 9 $$
$$ P_{\text{group}} = \frac{C_{in} \times C_{out} \times 9}{g} $$
This design reduces the detection head’s parameters by approximately 30% while maintaining accuracy, making it highly suitable for deployment on resource-constrained UAV drones.
3.4 Shape-NWD Loss Function
Conventional IoU-based loss functions are sensitive to the shape and scale of bounding boxes, especially for tiny objects. We propose Shape-NWD, which combines the normalized Wasserstein distance with shape-aware weighting inspired by Shape-IoU. The key idea is to assign higher weight to the longer side of the ground-truth box, reflecting the physical intuition that long, thin objects are more sensitive to displacement along their major axis. The loss is defined as follows:
$$ hh = \frac{2 \cdot (h_{gt} / \text{ratio})}{(h_{gt} / \text{ratio}) + (w_{gt} / \text{ratio})} $$
$$ ww = \frac{2 \cdot (w_{gt} / \text{ratio})}{(h_{gt} / \text{ratio}) + (w_{gt} / \text{ratio})} $$
$$ b = \frac{(w – w_{gt})^2 + (h – h_{gt})^2}{2 \times \text{weight}^2}, \quad \text{weight}=2 $$
$$ d = hh \times \left( \frac{x_c – x_c^{gt}}{c} \right)^2 + ww \times \left( \frac{y_c – y_c^{gt}}{c} \right)^2 + b $$
$$ \text{Shape\_NWD} = \exp\left( -\frac{d}{c} \right) $$
$$ L_{\text{Shape-NWD}} = 1 – \text{Shape\_NWD} $$
where \((x_c, y_c, h, w)\) and \((x_c^{gt}, y_c^{gt}, h_{gt}, w_{gt})\) denote the predicted and ground-truth bounding box parameters respectively, and \(c\) is a dataset-dependent constant (we set \(c=300\) for VisDrone). This loss provides smoother gradients for small objects, leading to more accurate regression.
3.5 Knowledge Distillation
To further boost performance without increasing inference complexity, we apply masked generative distillation (MGD) to the backbone of YOLO-DoS. A larger teacher model (YOLO11m) is used to guide the student during training. The distillation loss is added to the total objective with a weight \(\alpha\). We select \(\alpha=0.3\) based on ablation experiments. The distillation focuses on four feature maps from the backbone, effectively transferring richer semantic knowledge to the compact student model.
4. Experiments
4.1 Dataset and Settings
All experiments are conducted on the VisDrone2019 dataset, which contains 10,209 images (6,471 for training, 548 for validation, 3,190 for testing) annotated with 10 categories. Images have resolutions of 960×540 and 1360×765. We use an NVIDIA RTX A2000 GPU (8GB) with PyTorch 2.3.0, CUDA 11.8, and Python 3.10. Training hyperparameters: epochs=100, img_size=640, batch_size=16, optimizer=SGD, workers=6.
4.2 Ablation Studies
We conduct single-module and multi-module ablation experiments to verify each component’s contribution. Table 1 reports the results for single-module improvements over the baseline YOLO11n with GhostNetV3 backbone.
| Ghost-MSNFCA | DFC-PSIENet | LAED | Shape-NWD | mAP50/% | mAP50:95/% | Params (M) | GFLOPs |
|---|---|---|---|---|---|---|---|
| √ | 28.9 | 16.5 | 7.3 | 11.5 | |||
| √ | 31.5 | 18.2 | 8.1 | 14.6 | |||
| √ | 31.3 | 17.9 | 7.95 | 12.4 | |||
| √ | 31.1 | 17.6 | 7.23 | 11.3 |
Table 2 shows the effect of different numbers of branches in the MSNFE module.
| Branches | Kernel sizes | mAP50/% | mAP50:95/% | Params (M) |
|---|---|---|---|---|
| 3 | 3,5,7 | 27.7 | 14.4 | 6.9 |
| 4 | 3,5,7,9 | 28.9 | 16.5 | 7.1 |
| 5 | 3,5,7,9,11 | 28.3 | 15.5 | 7.2 |
Table 3 compares different fusion strategies for DCAM.
| Method | mAP50/% | mAP50:95/% | Params (M) |
|---|---|---|---|
| SAM only | 26.9 | 14.1 | 7.1 |
| FAM only | 27.3 | 14.3 | 7.2 |
| Cascade | 28.4 | 15.4 | 7.2 |
| Parallel (ours) | 28.9 | 16.5 | 7.3 |
Table 4 evaluates the impact of group number in the classification head.
| Group size (k) | mAP50/% | mAP50:95/% | Params (M) | GFLOPs |
|---|---|---|---|---|
| 8 | 31.7 | 17.8 | 2.41 | 5.9 |
| 16 | 31.6 | 17.9 | 2.47 | 6.2 |
| 32 | 31.3 | 17.5 | 2.39 | 5.7 |
Multi-module ablation results are shown in Table 5.
| Ghost-MSNFCA | DFC-PSIENet | LAED | Shape-NWD | mAP50/% | mAP50:95/% | Params (M) | GFLOPs |
|---|---|---|---|---|---|---|---|
| √ | √ | 31.6 | 18.2 | 8.1 | 14.6 | ||
| √ | √ | 31.3 | 17.9 | 7.95 | 12.4 | ||
| √ | √ | √ | 33.0 | 18.2 | 8.06 | 15.5 | |
| √ | √ | √ | √ | 33.1 | 18.7 | 8.06 | 14.9 |
4.3 Loss Function Comparison
We compare several loss functions in Table 6. Shape-NWD achieves the best results.
| Loss | mAP50/% | mAP50:95/% |
|---|---|---|
| CIoU | 31.3 | 16.7 |
| Inner-CIoU | 31.5 | 17.7 |
| 0.7×ShapeIoU+0.3×DotDistance | 31.5 | 17.5 |
| Shape-NWD (ours) | 33.1 | 18.7 |
4.4 Knowledge Distillation
We select YOLO11m as the teacher model after comparing YOLO11s, m, and l. Results with different distillation weights are shown in Table 7.
| Distillation weight α | mAP50/% | mAP50:95/% | GFLOPs |
|---|---|---|---|
| 0.3 | 36.2 | 21.6 | 14.9 |
| 0.5 | 31.5 | 18.9 | 14.5 |
4.5 Comparison with State-of-the-Art
Table 8 compares YOLO-DoS with several mainstream detectors on VisDrone2019.
| Model | mAP50/% | mAP50:95/% | Params (M) | GFLOPs |
|---|---|---|---|---|
| Faster R-CNN | 22.3 | 16.3 | 41.39 | – |
| RetinaNet | 24.1 | 16.9 | 36.59 | – |
| YOLOv5s | 29.7 | 16.2 | 7.04 | – |
| YOLOv8s | 29.0 | 16.6 | 3.0 | – |
| YOLO11n | 33.5 | 19.5 | 2.6 | – |
| YOLO-DoS (ours) | 36.2 | 21.6 | 8.06 | 14.9 |
Our YOLO-DoS achieves the highest mAP50 and mAP50:95 while maintaining a moderate model size, demonstrating its suitability for real-time UAV drone applications.
4.6 Visualization Analysis
We visualize gradient distributions and heatmaps to understand the improvements. The gradient analysis shows that YOLO-DoS maintains more stable and uniform gradient propagation throughout the network, avoiding the vanishing phenomenon observed in the baseline. Heatmap comparisons reveal that our model produces more concentrated and continuous activation on target regions while effectively suppressing background responses. In challenging scenarios such as occlusion, dense crowds, low-light conditions, and cluttered backgrounds, YOLO-DoS significantly reduces false negatives and false positives compared to YOLO11n.
5. Conclusion
In this work, we presented YOLO-DoS, a lightweight and accurate detection algorithm tailored for small objects in complex UAV drone images. By integrating the Ghost-MSNFCA module for enhanced feature extraction, the DFC-PSIENet for improved semantic fusion, the LAED head for computational efficiency, and the Shape-NWD loss for precise regression, our model achieves superior performance on the VisDrone2019 dataset. Knowledge distillation further boosts accuracy without extra inference cost. The comprehensive ablation and comparison experiments validate the effectiveness of each component. Future efforts will focus on extending the distillation to the neck and detection head, and exploring more efficient backbone designs to further reduce model size while maintaining high precision for UAV drone-based detection tasks.
