A Novel Feature Fusion-Enhanced Approach for Small Target Detection in China UAV Drone Imagery

The rapid proliferation and cost reduction of unmanned aerial vehicle (UAV) technology have propelled its widespread adoption across diverse sectors within China and globally, including precision agriculture, urban planning, traffic monitoring, and disaster assessment. China UAV drones, equipped with high-resolution cameras, capture vast amounts of aerial imagery, creating an unprecedented demand for efficient and accurate automated analysis. At the heart of this analysis lies the critical task of object detection. However, detecting objects in imagery acquired by China UAV drones presents a formidable set of challenges distinct from generic object detection. Targets often appear extremely small due to high flying altitudes, resulting in limited and sparse pixel-level features. The scenes are typically complex, with cluttered backgrounds (e.g., dense urban landscapes, varied terrain) that can easily obscure or mimic small objects. Furthermore, targets like vehicles or pedestrians can be densely distributed, leading to severe occlusion. These factors collectively contribute to significant issues of missed detections and false alarms, hindering the practical deployment of autonomous systems reliant on China UAV drone data.

To address these challenges inherent to China UAV drone operations, we propose MSF-YOLO, an enhanced small-target detection algorithm based on the YOLOv11n architecture. Our methodology focuses on feature fusion enhancement and strategic sampling optimization to bolster the model’s capability to discern and localize minuscule objects against complex backdrops. The core innovations include: integrating a PixelUnshuffle module post-convolution to preserve spatial details during downsampling; employing a DySample dynamic upsampling module for adaptive feature reconstruction; designing a Dual Input Feature Merge (DIFM) module to replace standard residual connections for more effective feature integration; and constructing an Inter Layer Feature Fusion (ILFF) module to enrich the neck network’s input with multi-scale contextual information. Comprehensive evaluations on the VisDrone dataset, a benchmark comprising challenging scenarios captured by China UAV drones, demonstrate that MSF-YOLO significantly outperforms the baseline YOLOv11n, achieving a 9.5% and 7.1% increase in mAP@50 and mAP@50:95, respectively. The algorithm also exhibits strong generalization capabilities on other aerial image datasets like RSOD, DOTA, and HRSC2016, confirming its robustness and practical utility for applications involving China UAV drone imagery.

Related Work

The task of detecting small targets in imagery from China UAV drones has spurred extensive research, primarily focusing on two intertwined aspects: feature fusion enhancement and sampling strategy optimization.

Feature Fusion Enhancement: The core objective is to amplify the target-specific information within the feature maps. Early approaches centered on constructing feature pyramid networks (FPNs) to handle scale variation. The path aggregation network (PAN) introduced bi-directional pathways, and the bidirectional feature pyramid network (BiFPN) further reinforced information flow with additional lateral connections. However, these methods often rely on simple concatenation, where the weak signals of small targets can be overwhelmed by stronger features from larger objects or background. Subsequent exploration shifted towards attention mechanisms, which allocate weights to highlight salient features. The squeeze-and-excitation network (SENet) introduced channel-wise attention. The convolutional block attention module (CBAM) combined both channel and spatial attention, while coordinate attention (CA) incorporated positional information along horizontal and vertical axes. The ACmix module innovatively proposed a hybrid model integrating both self-attention and convolution operations. Although attention methods enhance feature discriminability, many introduce considerable parameter overhead with diminishing returns. Contemporary research explores cross-layer feature interaction and multi-module synergy. For instance, the CF-YOLO framework proposed a cross-scale feature pyramid network (CS-FPN) to mitigate information decay and a dual-module cooperative mechanism for spatial alignment and adaptive fusion. Nevertheless, existing methods often lack synergistic adaptation across scales and layers, and the stacking of multiple modules can lead to feature redundancy and reduced inference speed, which is critical for real-time processing on China UAV drone platforms.

Sampling Strategy Optimization: Sampling strategies critically impact the preservation and recovery of fine-grained information essential for small targets. Traditional downsampling via strided convolution can readily discard the scarce features of tiny objects. While advanced methods like adaptive downsampling (ADown) in YOLOv9 or the Context-Guided Block attempt to preserve details, their effectiveness for detection accuracy can be inconsistent. For upsampling, classical methods like nearest-neighbor or bilinear interpolation are computationally efficient but poor at recovering textures. The content-aware reassembly of features (CARAFE) module aggregates contextual information dynamically but at the cost of significant parameters and slower operation. The DySample upsampler, reconstructing the process from a point-sampling perspective, dynamically adjusts sampling locations and weights without needing high-resolution guidance, offering a compelling balance between detail recovery and lightweight design suitable for China UAV drone applications.

Methodology: The MSF-YOLO Framework

Our work is built upon YOLOv11n, the most lightweight version in the YOLOv11 series, chosen for its suitability for real-time detection on resource-constrained devices like China UAV drones. We introduce several key modifications to its backbone, neck, and core building blocks to specifically enhance small target detection capabilities. The overall architecture of MSF-YOLO is illustrated below, integrating the proposed PixelUnshuffle, DySample, C3k2_DIFM, and ILFF modules.

1. PixelUnshuffle for Detail-Preserving Downsampling

The standard convolutional layers with stride in YOLOv11n’s backbone perform feature extraction and spatial reduction simultaneously. This downsampling, while necessary, inevitably discards fine-grained information, to which the already sparse features of small targets are particularly vulnerable. To mitigate this, we decouple the downsampling operation from the convolution. We modify the convolutional strides to eliminate their downsampling effect and reduce their output channels to control parameters. Subsequently, we employ a PixelUnshuffle operation. This operation rearranges pixels from the spatial dimensions into the channel dimension, effectively preserving all original information while reducing spatial resolution. For an input feature map $\mathbf{X}$ with dimensions $(C, H, W)$, PixelUnshuffle splits it into $s^2$ sub-feature maps (for a scale factor $s=2$) and concatenates them along the channel dimension:

$$ \mathbf{X}_{0,0} = \mathbf{X}[0:H:s, 0:W:s], $$
$$ \mathbf{X}_{0,1} = \mathbf{X}[0:H:s, 1:W:s], $$
$$ \mathbf{X}_{1,0} = \mathbf{X}[1:H:s, 0:W:s], $$
$$ \mathbf{X}_{1,1} = \mathbf{X}[1:H:s, 1:W:s], $$
$$ \mathbf{Y} = \text{Concat}(\mathbf{X}_{0,0}, \mathbf{X}_{0,1}, \mathbf{X}_{1,0}, \mathbf{X}_{1,1}) \rightarrow (4C, H/2, W/2). $$

This transformation converts spatial information into channel information, providing subsequent layers with richer, more detailed feature maps that are crucial for detecting small objects in China UAV drone imagery.

2. DySample for Adaptive Feature Upsampling

YOLOv11n employs nearest-neighbor interpolation for upsampling, which is fast but often produces jagged edges and fails to recover nuanced details. We replace it with DySample, a lightweight, content-aware dynamic upsampler. DySample dynamically adjusts sampling positions based on the input feature content, allowing for precise reconstruction of edges and textures critical for small targets. Its operation can be summarized in two phases: generating a dynamic sampling point set $\mathcal{S}$ and performing the sampling. First, a base sampling grid $\mathbf{G}$ is created based on the input size and scale factor $s$. Then, offset maps $\mathbf{O}$ are generated via learnable linear layers, combining a static scaling factor (0.25) and a dynamic scaling factor learned from the input features:

$$ \mathbf{O}_{\text{static}} = 0.25 \cdot \text{Linear}(\mathbf{X}), $$
$$ \mathbf{O}_{\text{dynamic}} = 0.5 \cdot \sigma(\text{Linear}_2(\text{Linear}_1(\mathbf{X}))) \cdot \text{Linear}_1(\mathbf{X}), $$
$$ \mathbf{O} = \mathbf{O}_{\text{static}} + \mathbf{O}_{\text{dynamic}}. $$

The final sampling points are $\mathcal{S} = \mathbf{G} + \mathbf{O}$. A `grid_sample` function with bilinear interpolation then produces the upsampled feature map using $\mathcal{S}$. This adaptive mechanism allocates more sampling resources to intricate areas like small targets while simplifying sampling in homogeneous regions, enhancing detail recovery without a heavy computational burden—a vital characteristic for processing streams from China UAV drones.

3. DIFM: Dual Input Feature Merge Module

The bottleneck structure within the C3k2 module, a core feature extractor in YOLOv11n, uses residual connections to alleviate gradient vanishing. However, the simple addition in residual connections can dilute the weak features of small targets with stronger background features. We redesign this connection into a Dual Input Feature Merge (DIFM) module. The DIFM first concatenates the two input feature maps and uses a pointwise convolution for channel compression and fusion. It then processes the fused features through parallel pathways: a channel attention module (CAM) to highlight relevant channels, and a multi-scale spatial feature extraction branch employing $3\times3$, $5\times5$, and $7\times7$ depthwise convolutions followed by a spatial attention module (SAM) to emphasize target regions. The outputs from these attention-enhanced pathways are summed and projected back to the desired output dimension via another pointwise convolution. This process ensures a more refined integration of multi-scale spatial and channel-wise information, actively strengthening small target features while suppressing irrelevant background clutter commonly found in China UAV drone scenes.

4. ILFF: Inter Layer Feature Fusion Module

The backbone network produces feature maps at different depths, each carrying distinct semantic and detail information. Deeper feature maps have larger receptive fields but lower resolution, often containing too few pixels representing a small target, leading to missed detections. Shallower feature maps retain higher resolution and more fine-grained details but lack semantic strength. To harness the complementary strengths of these layers, we introduce the Inter Layer Feature Fusion (ILFF) module between the backbone and the neck network. The ILFF takes a higher-level (deeper) feature map and a lower-level (shallower) one. It upsamples the deeper map for spatial alignment, concatenates it with the shallower map, and then processes the concatenated features through sequential channel and spatial attention modules (CAM & SAM) for adaptive normalization. Finally, a pointwise convolution compresses the channels, removing redundancy. This module effectively transfers and blends semantic context from deep layers into the detail-rich shallow layers, providing the neck network with feature maps that are both semantically meaningful and rich in spatial detail, thereby significantly boosting small target detection accuracy for China UAV drone imagery.

Experiments and Analysis

Datasets: We primarily evaluate our model on the VisDrone2019-DET dataset, a large-scale benchmark collected from various urban and rural areas in China using UAV platforms, containing 10 object categories like pedestrian, car, and bicycle. It features a high density of small and极小 targets, making it ideal for validating small target detection algorithms. To assess generalization, we also test on the RSOD (aircraft, playground, etc.), DOTA (large-scale aerial images with 16 categories), and HRSC2016 (ship detection) datasets.

Implementation Details: Experiments were conducted on a Tesla T4 GPU. Models were trained for 400 epochs with an image size of $640 \times 640$, a batch size of 9, using SGD optimizer with an initial learning rate of 0.01, momentum of 0.937, and weight decay of 0.0005.

Evaluation Metrics: We use standard object detection metrics: Precision (P), Recall (R), mean Average Precision at IoU=0.5 (mAP@50), and mAP averaged over IoU from 0.5 to 0.95 with a 0.05 step (mAP@50:95). We also report model size via Parameters (M), computational complexity via GFLOPs, inference speed via FPS (Frames Per Second), and GPU memory usage during inference.

Ablation Studies

To understand the contribution of each proposed component, we conduct ablation studies on the VisDrone validation set, starting from the YOLOv11n baseline. Results are summarized in the following tables.

Single-Module Ablation: This experiment adds one proposed module at a time to the baseline.

YOLOv11n DySample PixelUnshuffle C3k2_DIFM ILFF P(%) R(%) mAP50(%) mAP50-95(%) Params(M) GFLOPs FPS Mem(MB)
46.5 34.6 35.1 20.6 2.6 6.3 214 94.5
44.1 34.1 33.7 19.7 2.6 6.3 213 94.5
47.9 36.0 37.0 21.9 2.6 10.7 181 106.3
45.4 34.1 34.4 20.0 2.7 7.6 205 96.6
52.6 40.2 41.6 25.5 2.7 8.5 197 150.5

Analysis: PixelUnshuffle provides a clear boost in mAP50 (+1.9%), confirming its role in preserving details. ILFF yields the most dramatic improvement (+6.5% mAP50), proving the immense value of cross-layer feature fusion for small targets in China UAV drone data. DySample and C3k2_DIFM alone slightly hurt performance, indicating they require synergistic combination with other components to shine.

Multi-Module Ablation: We progressively combine modules to find the optimal configuration.

DySample PixelUnshuffle C3k2_DIFM ILFF P(%) R(%) mAP50(%) mAP50-95(%) Params(M) GFLOPs FPS Mem(MB)
47.9 36.1 36.7 21.9 2.6 10.7 187 106.3
45.8 35.2 35.2 20.6 2.7 7.6 202 96.6
52.9 39.9 41.7 25.6 2.7 8.5 195 150.5
48.8 36.7 37.3 22.3 2.7 12.0 181 105.2
52.2 42.1 43.5 26.3 2.8 13.2 199 152.7
54.1 41.8 43.7 26.9 2.7 12.6 181 167.7
53.3 43.4 44.6 27.7 2.8 17.5 172 180.5

Analysis: The combination of ILFF with either C3k2_DIFM or DySample shows significant gains over ILFF alone. The full combination of all four modules (MSF-YOLO) achieves the best performance: 44.6% mAP50, a 9.5% absolute improvement over the baseline YOLOv11n, with only a 0.2M parameter increase. The FPS remains high at 172, demonstrating that our enhancements effectively tackle the small target detection challenge in China UAV drone imagery while maintaining real-time feasibility.

Comparative Experiments

We compare MSF-YOLO against other YOLO variants and recent state-of-the-art methods on the VisDrone dataset.

Model P(%) R(%) mAP50(%) mAP50-95(%) Params(M) GFLOPs FPS Mem(MB)
YOLOv11n (Baseline) 46.5 34.6 35.1 20.6 2.6 6.3 214 94.5
YOLOv11s 53.9 39.7 41.2 24.9 9.4 21.3 156 155.8
YOLOv8n 44.3 35.6 35.2 20.6 3.0 8.1 212 96.6
YOLOv10n 44.3 33.3 33.7 19.7 2.7 8.2 213 96.6
RTMDet-Tiny 49.5 36.3 37.8 22.7 4.9 8.0
RT-DETR-R18 55.4 37.6 36.6 21.0 19.9 57.3 60 223
MSF-YOLO (Ours) 53.3 43.4 44.6 27.7 2.8 17.5 172 180.5

Analysis: MSF-YOLO surpasses all comparable lightweight models (YOLOv8n, YOLOv10n, RTMDet-Tiny) by a large margin in mAP50. It even outperforms the larger YOLOv11s model while using ~70% fewer parameters. Compared to transformer-based detectors like RT-DETR, our method achieves superior accuracy with far less computational cost and much higher speed, making it markedly more suitable for deployment on China UAV drone platforms where resources are constrained.

Per-Class Performance: The improvement is consistent across all 10 categories in VisDrone, particularly for challenging small classes like “person” and “bicycle”.

Model Pedestrian Person Bicycle Car Van Truck Tricycle Awning-tricycle Bus Motor
YOLOv11n 37.4 29.1 10.2 76.9 40.1 32.4 23.0 12.4 49.7 39.8
YOLOv11s 44.5 34.5 15.4 80.2 45.6 38.0 29.2 16.9 60.1 45.7
MSF-YOLO 53.1 42.0 17.3 84.6 50.6 38.9 31.8 17.8 62.9 52.7

Generalization Experiments

To verify robustness, we evaluate MSF-YOLO on three additional aerial image datasets. The consistent improvements demonstrate its strong generalization capability beyond the VisDrone domain.

Dataset Model P(%) R(%) mAP50(%) mAP50-95(%)
RSOD YOLOv8n 91.2 90.7 92.7 65.4
YOLOv11n 91.4 90.8 93.1 65.2
MSF-YOLO 92.0 91.4 93.7 65.8
DOTA YOLOv8n 67.7 53.4 56.6 34.2
YOLOv11n 68.9 55.9 59.3 35.8
MSF-YOLO 71.1 56.2 61.2 38.9
HRSC2016 YOLOv8n 87.5 78.1 86.2 68.6
YOLOv11n 84.9 83.0 85.4 71.2
MSF-YOLO 87.7 81.4 87.5 73.7

Conclusion

In this work, we tackled the persistent challenges of small target detection in imagery captured by China UAV drones, such as sparse features, complex backgrounds, and dense distributions. We presented MSF-YOLO, an enhanced algorithm based on YOLOv11n that strategically improves feature fusion and sampling processes. The integration of the PixelUnshuffle and DySample modules ensures better preservation and reconstruction of fine-grained details. The novel DIFM and ILFF modules work in concert to strengthen multi-scale target features and suppress irrelevant background information through effective cross-layer fusion. Extensive experiments on the VisDrone dataset confirm that MSF-YOLO achieves state-of-the-art performance among lightweight detectors, with a significant 9.5% boost in mAP50 over the baseline while maintaining real-time inference speed. Its proven effectiveness on multiple aerial image benchmarks underscores strong generalization ability. This makes MSF-YOLO a highly promising solution for practical applications involving China UAV drones, including surveillance, logistics, and environmental monitoring. Future work will focus on further architectural optimizations to reduce model complexity and facilitate seamless deployment on edge devices carried by China UAV drones.

Scroll to Top