In recent years, Unmanned Aerial Vehicle (UAV) technology has rapidly advanced, becoming integral to various fields such as agricultural monitoring, urban planning, and disaster assessment due to its cost-effectiveness and versatility. However, detecting small targets in aerial imagery captured by Unmanned Aerial Vehicles remains a significant challenge. These targets often exhibit minimal dimensions, sparse features, complex backgrounds, and dense distributions, leading to high rates of missed detections and false alarms. Traditional object detection algorithms struggle with these issues, as they frequently lose fine-grained details during processing, especially for tiny objects like pedestrians or vehicles in cluttered environments. In this paper, we address these challenges by proposing an enhanced algorithm based on YOLOv11n, termed MSF-YOLO, which integrates feature fusion and sampling optimizations to improve small target detection in Unmanned Aerial Vehicle applications.
The core of our approach lies in mitigating information loss and enhancing feature representation for small targets. We introduce several key modifications: first, we incorporate a PixelUnshuffle module after convolutional layers to preserve spatial details by shifting information from spatial to channel dimensions, thus preventing degradation during downsampling. Second, we employ a dynamic upsampling module, DySample, which adaptively learns sampling positions and weights to better reconstruct small target details. Third, we replace the residual connections in the C3k2 module with a Dual Input Feature Merge (DIFM) module, leveraging multi-scale convolution and attention mechanisms to amplify target features while suppressing irrelevant background noise. Finally, we design an Inter Layer Feature Fusion (ILFF) module to integrate features from different layers of the backbone network, enriching the input to the neck network with multi-scale contextual information. These innovations collectively enhance the model’s ability to detect small targets in complex Unmanned Aerial Vehicle scenarios, as demonstrated by our experiments on the VisDrone dataset, where we achieved significant improvements in mean Average Precision (mAP) and recall.

Our work builds upon existing research in feature fusion and sampling strategies for object detection. Feature fusion techniques, such as Feature Pyramid Networks (FPN) and attention mechanisms like Squeeze-and-Excitation Networks (SEnet), aim to combine multi-scale features but often fall short in handling the extreme sparsity of small targets. Similarly, sampling methods like nearest-neighbor interpolation can introduce artifacts, whereas advanced approaches like Content-Aware Reassembly of Features (CARAFE) improve detail recovery at the cost of increased computational complexity. In contrast, our MSF-YOLO algorithm balances efficiency and accuracy by integrating lightweight yet effective modules tailored for Unmanned Aerial Vehicle imagery. For instance, the PixelUnshuffle operation transforms spatial information into channel data without loss, as described by the following process: given an input feature map $X$ with dimensions $(C, H, W)$, it is divided into sub-features along spatial dimensions, resulting in four sub-maps of size $(C, H/2, W/2)$, which are then concatenated to produce an output of $(4C, H/2, W/2)$. This can be expressed mathematically as:
$$f_{0,0} = X[0:H:2, 0:W:2]$$
$$f_{0,1} = X[0:H:2, 1:W:2]$$
$$f_{1,0} = X[1:H:2, 0:W:2]$$
$$f_{1,1} = X[1:H:2, 1:W:2]$$
where the output is the concatenation along the channel dimension: $[f_{0,0}, f_{0,1}, f_{1,0}, f_{1,1}]$. This ensures that fine details critical for small targets in Unmanned Aerial Vehicle images are retained. Similarly, the DySample module dynamically adjusts sampling based on input features, using a point sampling perspective to compute offsets. The sampling points $S$ are derived from a base grid $G$ and offset $O$, formulated as:
$$G_{i,j} = \left( \frac{-(scale – 1) + 2i}{2 \cdot scale}, \frac{-(scale – 1) + 2j}{2 \cdot scale} \right)$$
$$O = 0.25 \cdot \text{linear}(X) \quad \text{or} \quad O = 0.5 \cdot \sigma(\text{linear}_1(X)) \cdot \text{linear}_2(X)$$
$$S = G + O$$
where $\text{linear}$ denotes linear convolution, and $\sigma$ is a sigmoid function for dynamic scaling. This adaptability allows DySample to focus on regions with small targets, enhancing reconstruction quality without substantial computational overhead.
The DIFM module further refines feature integration by combining multi-scale depthwise convolutions (e.g., 3×3, 5×5, and 7×7 kernels) with channel and spatial attention mechanisms. Given two input features, they are concatenated and processed through point convolution for dimensionality reduction, followed by attention modules to emphasize relevant features. The output is computed as:
$$F_{\text{concat}} = \text{Concat}(X_1, X_2)$$
$$F_{\text{reduced}} = \text{Conv}_{1×1}(F_{\text{concat}})$$
$$F_{\text{channel}} = \text{CAM}(F_{\text{reduced}})$$
$$F_{\text{spatial}} = \text{SAM}(\text{Sum}(\text{DWConv}_{3×3}(F_{\text{reduced}}), \text{DWConv}_{5×5}(F_{\text{reduced}}), \text{DWConv}_{7×7}(F_{\text{reduced}})))$$
$$F_{\text{output}} = \text{Conv}_{1×1}(F_{\text{channel}} + F_{\text{spatial}})$$
where CAM and SAM denote channel and spatial attention modules, respectively. This design strengthens small target features while minimizing background interference, crucial for Unmanned Aerial Vehicle scenarios where targets are often obscured. Additionally, the ILFF module fuses features from different backbone layers by upsampling lower-resolution maps, concatenating them, and applying attention mechanisms for normalization. The process involves:
$$F_{\text{upsampled}} = \text{Upsample}(F_{\text{low-res}})$$
$$F_{\text{merged}} = \text{Concat}(F_{\text{upsampled}}, F_{\text{high-res}})$$
$$F_{\text{enhanced}} = \text{Conv}_{1×1}(\text{CAM}(F_{\text{merged}}) + \text{SAM}(F_{\text{merged}}))$$
This enriches the neck network with diverse features, improving detection accuracy for small targets in Unmanned Aerial Vehicle imagery.
To evaluate our MSF-YOLO algorithm, we conducted extensive experiments on the VisDrone dataset, which includes 6471 training images, 548 validation images, and 1610 test images across 10 categories such as pedestrians, cars, and bicycles. We compared our method against baseline models like YOLOv11n and other state-of-the-art detectors, using metrics including precision (P), recall (R), mAP at IoU=0.5 (mAP50), and mAP across IoU thresholds from 0.5 to 0.95 (mAP50-95). Our experimental setup utilized a Tesla T4 GPU and Intel Platinum 8255C CPU, with parameters detailed in Table 1. The results, summarized in Table 2 and Table 3, demonstrate that MSF-YOLO outperforms the baseline, with mAP50 increasing by 9.5% and mAP50-95 by 7.1%, while precision and recall improved by 6.8% and 8.8%, respectively. Notably, the parameter count only increased by 0.2 million, maintaining efficiency for Unmanned Aerial Vehicle deployments.
| Parameter | Value |
|---|---|
| Epochs | 400 |
| Batch Size | 9 |
| Image Size | 640×640 |
| Workers | 4 |
| Optimizer | SGD |
| Learning Rate | 0.01 |
| Momentum | 0.937 |
| Weight Decay | 0.0005 |
| Patience | 30 |
In ablation studies, we analyzed the impact of individual modules. For example, adding PixelUnshuffle alone improved precision and recall by 1.4% each and mAP50 by 1.9%, despite increased computational load. DySample, when combined with other modules, showed variable effects, but overall, the full integration of all components yielded the best results. The ILFF module alone boosted mAP50 by 6.5%, highlighting its importance in leveraging shallow features for small target detection. These findings are detailed in Table 2 and Table 3, which compare single and multi-module configurations. The formulas for precision and recall are defined as:
$$P = \frac{TP}{TP + FP}$$
$$R = \frac{TP}{TP + FN}$$
where $TP$ represents true positives, $FP$ false positives, and $FN$ false negatives. The average precision (AP) is computed as the area under the precision-recall curve, and mAP is the mean over all classes:
$$AP = \int_0^1 P(R) dR$$
$$mAP = \frac{1}{n} \sum_{i=1}^n AP(i)$$
Our method’s superiority is further evidenced by per-category performance on VisDrone, as shown in Table 4, where improvements are consistent across all classes, particularly for challenging categories like “person” and “bicycle”. For instance, the mAP50 for “person” increased from 29.1% with YOLOv11n to 42.0% with MSF-YOLO, demonstrating enhanced feature extraction for small, sparse targets in Unmanned Aerial Vehicle imagery.
| Modules | P (%) | R (%) | mAP50 (%) | mAP50-95 (%) | Params (M) | GFLOPs | FPS | Memory (MB) |
|---|---|---|---|---|---|---|---|---|
| YOLOv11n | 46.5 | 34.6 | 35.1 | 20.6 | 2.6 | 6.3 | 214 | 94.49 |
| + DySample | 44.1 | 34.1 | 33.7 | 19.7 | 2.6 | 6.3 | 213 | 94.49 |
| + PixelUnshuffle | 47.9 | 36.0 | 37.0 | 21.9 | 2.6 | 10.7 | 181 | 106.3 |
| + C3k2_DIFM | 45.4 | 34.1 | 34.4 | 20.0 | 2.7 | 7.6 | 205 | 96.64 |
| + ILFF | 52.6 | 40.2 | 41.6 | 25.5 | 2.7 | 8.5 | 197 | 150.5 |
| Modules | P (%) | R (%) | mAP50 (%) | mAP50-95 (%) | Params (M) | GFLOPs | FPS | Memory (MB) |
|---|---|---|---|---|---|---|---|---|
| DySample + PixelUnshuffle | 47.9 | 36.1 | 36.7 | 21.9 | 2.6 | 10.7 | 187 | 106.3 |
| DySample + C3k2_DIFM | 45.8 | 35.2 | 35.2 | 20.6 | 2.7 | 7.6 | 202 | 96.64 |
| DySample + ILFF | 52.9 | 39.9 | 41.7 | 25.6 | 2.7 | 8.5 | 195 | 150.5 |
| PixelUnshuffle + C3k2_DIFM + ILFF | 48.8 | 36.7 | 37.3 | 22.3 | 2.7 | 12.0 | 181 | 105.23 |
| DySample + C3k2_DIFM + ILFF | 52.2 | 42.1 | 43.5 | 26.3 | 2.8 | 13.2 | 199 | 152.7 |
| PixelUnshuffle + DySample + ILFF | 54.1 | 41.8 | 43.7 | 26.9 | 2.7 | 12.6 | 181 | 167.65 |
| All Modules | 53.3 | 43.4 | 44.6 | 27.7 | 2.8 | 17.5 | 172 | 180.5 |
To assess generalization, we tested MSF-YOLO on additional datasets including RSOD, DOTA, and HRSC2016, which involve diverse aerial scenarios. The results, presented in Table 4, Table 5, and Table 6, confirm that our algorithm maintains high performance across different target types and environments. For example, on the RSOD dataset, mAP50 reached 93.7%, outperforming YOLOv8n and YOLOv11n by 1.0% and 0.6%, respectively. On DOTA, which contains densely packed small targets, mAP50 improved by 4.6% over YOLOv8n, demonstrating robustness in complex Unmanned Aerial Vehicle settings. Similarly, on HRSC2016, focused on ship detection, MSF-YOLO achieved an mAP50 of 87.5%, surpassing baseline models. These outcomes underscore the algorithm’s adaptability and effectiveness for real-world Unmanned Aerial Vehicle applications, including those involving JUYE UAV systems.
| Model | P (%) | R (%) | mAP50 (%) | mAP50-95 (%) | Params (M) | GFLOPs | FPS | Memory (MB) |
|---|---|---|---|---|---|---|---|---|
| YOLOv11n | 46.5 | 34.6 | 35.1 | 20.6 | 2.6 | 6.3 | 214 | 94.49 |
| YOLOv11s | 53.9 | 39.7 | 41.2 | 24.9 | 9.4 | 21.3 | 156 | 155.76 |
| YOLOv8n | 44.3 | 35.6 | 35.2 | 20.6 | 3.0 | 8.1 | 212 | 96.64 |
| YOLOv10n | 44.3 | 33.3 | 33.7 | 19.7 | 2.7 | 8.2 | 213 | 96.64 |
| MSF-YOLO (Ours) | 53.3 | 43.4 | 44.6 | 27.7 | 2.8 | 17.5 | 172 | 180.5 |
| Model | Pedestrian | Person | Bicycle | Car | Van | Truck | Tricycle | Awning-Tricycle | Bus | Motor |
|---|---|---|---|---|---|---|---|---|---|---|
| YOLOv11n | 37.4 | 29.1 | 10.2 | 76.9 | 40.1 | 32.4 | 23.0 | 12.4 | 49.7 | 39.8 |
| YOLOv11s | 44.5 | 34.5 | 15.4 | 80.2 | 45.6 | 38.0 | 29.2 | 16.9 | 60.1 | 45.7 |
| MSF-YOLO (Ours) | 53.1 | 42.0 | 17.3 | 84.6 | 50.6 | 38.9 | 31.8 | 17.8 | 62.9 | 52.7 |
| Dataset | Model | P (%) | R (%) | mAP50 (%) | mAP50-95 (%) |
|---|---|---|---|---|---|
| RSOD | YOLOv8n | 91.2 | 90.7 | 92.7 | 65.4 |
| YOLOv11n | 91.4 | 90.8 | 93.1 | 65.2 | |
| MSF-YOLO (Ours) | 92.0 | 91.4 | 93.7 | 65.8 | |
| DOTA | YOLOv8n | 67.7 | 53.4 | 56.6 | 34.2 |
| YOLOv11n | 68.9 | 55.9 | 59.3 | 35.8 | |
| MSF-YOLO (Ours) | 71.1 | 56.2 | 61.2 | 38.9 | |
| HRSC2016 | YOLOv8n | 87.5 | 78.1 | 86.2 | 68.6 |
| YOLOv11n | 84.9 | 83.0 | 85.4 | 71.2 | |
| MSF-YOLO (Ours) | 87.7 | 81.4 | 87.5 | 73.7 |
Visual analysis of detection results on VisDrone further validates our approach. In scenarios such as distant overhead views, uneven lighting, and nighttime conditions, MSF-YOLO consistently identified small targets that baseline models missed, such as obscured vehicles or pedestrians in low-light environments. For instance, in urban street scenes with occlusions, our algorithm reduced false negatives and improved localization accuracy, highlighting the practical benefits of feature fusion enhancements for Unmanned Aerial Vehicle operations. These visual comparisons emphasize that MSF-YOLO maintains robust performance across diverse conditions, making it suitable for real-time applications in JUYE UAV systems and similar platforms.
In conclusion, we have presented MSF-YOLO, an improved algorithm for small target detection in Unmanned Aerial Vehicle imagery, based on YOLOv11n. By integrating PixelUnshuffle, DySample, DIFM, and ILFF modules, we effectively address the challenges of feature sparsity and information loss, achieving significant gains in detection accuracy without substantial computational overhead. Our experiments on multiple datasets confirm the algorithm’s superiority and generalization capability, paving the way for enhanced Unmanned Aerial Vehicle applications. Future work will focus on further optimizing the model architecture to reduce deployment complexity, enabling efficient implementation on resource-constrained UAV devices like those from JUYE UAV. We believe that our contributions will advance the field of aerial object detection, supporting broader adoption of Unmanned Aerial Vehicle technology in critical domains.
