Research on UAV Small Target Detection Using Multi-Scale Attention Mechanism

In the domains of agricultural monitoring, urban planning, and infrastructure inspection, Unmanned Aerial Vehicles (UAVs or drones) have become indispensable tools, capturing vast and detailed imagery from aerial perspectives. However, the analysis of these UAV-captured images presents formidable challenges for object detection systems. Targets often appear at drastically varying scales, with small objects constituting a minuscule fraction of the total pixels. The scenes are frequently cluttered with complex backgrounds, and objects are prone to occlusion and dense overlap. These factors collectively lead to significant rates of missed detections and false alarms, severely limiting the reliability of UAV-based applications. Enhancing detection accuracy under such demanding conditions is therefore a critical research focus.

To address these inherent difficulties in UAV drone imagery, this work presents STF-YOLOv8n, a novel object detection algorithm built upon the efficient YOLOv8n architecture. The core innovation lies in the integration of a multi-scale attention mechanism designed to strengthen feature representation for small and obscured targets. The proposed method systematically tackles information loss during downsampling, enhances focus on discriminative features amidst clutter, and refines bounding box localization. Specifically, we introduce a Spatial Depth-conversion Convolution (SPDConv) module to preserve fine-grained spatial details. We design a Top-K Sparse Attention mechanism with integrated Spatial Attention (TKESA) to dynamically prioritize relevant features across scales and positions. Furthermore, we adopt a dynamic bounding box regression loss, Inner-FocIoU, to improve the stability and precision of localization, particularly for small targets. Comprehensive evaluations on benchmark UAV drone datasets demonstrate that STF-YOLOv8n achieves superior detection accuracy compared to baseline and state-of-the-art methods while maintaining competitive inference speed, showcasing its strong potential for real-world UAV deployment.

1. Methodology: The STF-YOLOv8n Architecture

The proposed STF-YOLOv8n framework enhances the standard YOLOv8n pipeline through targeted modifications in its backbone feature extractor, feature fusion neck, and optimization objective. The overarching goal is to empower the model to better handle the unique challenges posed by UAV drone imagery: multi-scale small objects, complex backgrounds, and frequent occlusions.

1.1 Enhancing Feature Extraction with SPDConv

Traditional convolution and pooling operations in CNN backbones often lead to the irreversible loss of fine-grained information crucial for detecting small targets in UAV drone footage. To mitigate this, we replace standard strided convolutions with a Space-to-Depth Convolution (SPDConv) module at strategic points in the backbone. The SPDConv operates in two stages. First, the Space-to-Depth (SPD) layer performs a non-overlapping downsampling by reorganizing spatial blocks into channel dimensions. For a scale factor of 2, an input feature map $X$ with dimensions $(S, S, C)$ is split into four sub-features of size $(S/2, S/2, C)$, which are then concatenated along the channel axis to produce an intermediate feature $X’$ of size $(S/2, S/2, 4C)$. This process preserves all original information from the spatial domain. Subsequently, a non-strided convolution layer (stride=1) with $D$ filters transforms $X’$ to the final output $X”$ of size $(S/2, S/2, D)$, where $D < 4C$. Using a non-strided convolution avoids the asymmetric sampling and information loss associated with strided operations, allowing for more nuanced feature learning. The integration of SPDConv helps the backbone retain critical details of small UAV drone targets that would otherwise be diluted during downsampling.

1.2 Refining Feature Fusion with TKESA Attention

The feature pyramid network (neck) is responsible for fusing multi-scale features. To make this fusion more effective and focused, we introduce the TKESA (Top-K Sparse Attention with Spatial Attention) module. TKESA combines channel-wise sparse attention with spatial relation modeling to suppress irrelevant background noise and emphasize crucial target features across different resolutions, a vital capability for analyzing complex UAV drone scenes.

The TKESA module first generates Query (Q), Key (K), and Value (V) projections from the input feature map $F_{input}$. To incorporate richer local context, these projections are further refined using depthwise separable convolutions, resulting in $Q_1, K_1, V_1$. The core innovation is the application of a Top-K selection operator on the attention scores. Instead of using all pairwise similarities between Q and K, which can be noisy, TKESA retains only the largest $K$ scores for aggregation. This forces the model to focus on the most salient feature correspondences. We employ multiple sparse attention heads with different sparsity ratios (e.g., Top-K(C/2), Top-K(2C/3), etc.) to capture a diverse set of feature relationships. The outputs from these sparse heads are adaptively weighted by learnable coefficients $\lambda_i$ and summed.

$$Attn_i = \text{Softmax}(\text{Top-K}_i(Q_1 \cdot K_1^T)) \cdot V_1 \quad \text{for } i \in \{1,2,3,4\}$$

In parallel, a Position Attention Module (PAM) captures long-range spatial dependencies. It computes a spatial attention map by correlating features across all positions, allowing the model to understand contextual relationships crucial for resolving occlusions common in UAV drone imagery. The final output of the TKESA module is a weighted fusion of the sparse attention outputs and the spatially-enhanced features, followed by a pointwise convolution to restore channel dimensions.

$$\text{Output} = \text{Conv}_{1\times1}\left( \sum_{i=1}^{4} \lambda_i \cdot Attn_i + \text{PAM}(F_{input}) \right) + F_{input}$$

This design enables TKESA to dynamically focus on the most informative regions and channels while maintaining an awareness of the global scene structure, significantly boosting feature discriminability.

1.3 Optimizing Localization with Inner-FocIoU Loss

Accurate bounding box regression is paramount, especially for small UAV drone targets where slight misalignment can cause IoU to drop sharply. The standard CIoU loss considers overlap, center distance, and aspect ratio but can be unstable for small objects and imbalanced samples. Our proposed Inner-FocIoU loss combines the advantages of InnerIoU and FocalerIoU.

InnerIoU introduces an auxiliary bounding box scaled by a factor $ratio$ relative to the ground truth and prediction boxes. It calculates the IoU based on these inner regions ($inter$) and a modified union area ($union$).

$$inter = (\min(b_r^{gt*}, b_r) – \max(b_l^{gt*}, b_l)) \times (\min(b_b^{gt*}, b_b) – \max(b_t^{gt*}, b_t))$$

$$union = (w^{gt} \cdot h^{gt}) \cdot ratio^2 + (w \cdot h) \cdot ratio^2 – inter$$

$$\text{IoU}_{\text{inner}} = \frac{inter}{union}$$

By adjusting the $ratio$, the loss can adapt its focus: a smaller ratio tightens the regression focus for high-IoU cases, while a larger ratio provides a more forgiving target for low-IoU, difficult samples often encountered with distant UAV drone targets.

FocalerIoU addresses sample imbalance by remapping the IoU value to a focused range $[d, u]$, directing gradient effort towards medium-difficulty regression samples.

$$\text{IoU}_{\text{focaler}} =
\begin{cases}
0, & \text{IoU} < d \\
\frac{\text{IoU} – d}{u – d}, & d \leq \text{IoU} \leq u \\
1, & \text{IoU} > u
\end{cases}$$

$$\mathcal{L}_{\text{FocalerIoU}} = 1 – \text{IoU}_{\text{focaler}}$$

The Inner-FocIoU loss integrates these concepts, using the inner-IoU calculation within the focal framework: $\mathcal{L}_{\text{Inner-FocIoU}} = 1 – \text{IoU}_{\text{focaler}}(\text{IoU}_{\text{inner}})$. This hybrid loss provides more precise localization signals for small targets and balanced learning across easy and hard examples, leading to more stable and accurate bounding box predictions for UAV drone detection tasks.

2. Experimental Results and Analysis

We evaluated the proposed STF-YOLOv8n model on the challenging VisDrone2019-DET dataset, a large-scale benchmark collected by UAV drones under various scenarios. The dataset contains 10,209 images annotated with multiple categories. For our evaluation, we focused on three representative categories: People (small, dense targets), Car (medium-scale vehicles), and Bcar (large vehicles like buses and trucks), which collectively represent the core challenges of UAV-based monitoring. Models were trained for 200 epochs with an image size of 640×640 pixels. Standard metrics include Precision (P), Recall (R), mean Average Precision at IoU=0.5 (mAP50), and mAP over IoU thresholds from 0.5 to 0.95 (mAP50-95). We also report parameters, GFLOPs, and inference speed (FPS).

2.1 Ablation Study on Core Components

Ablation experiments were conducted to validate the contribution of each proposed component. The baseline is the standard YOLOv8n model.

SPDConv	TKESA	Inner-FocIoU	mAP50 (%)	mAP50-95 (%)	Params (M)	GFLOPs
–	–	–	46.7	26.8	3.01	8.2
✓	–	–	48.1 (+1.4)	27.7 (+0.9)	3.25	10.7
–	✓	–	50.7 (+4.0)	29.8 (+3.0)	2.91	12.1
–	–	✓	47.9 (+1.2)	27.6 (+0.8)	3.01	8.2
✓	✓	–	51.4 (+4.7)	30.0 (+3.2)	2.86	14.2
✓	✓	✓	52.5 (+5.8)	30.5 (+3.7)	2.86	14.2

The results clearly demonstrate the effectiveness of each module. SPDConv alone improves mAP50 by 1.4%, validating its role in preserving fine details for small UAV drone targets. The TKESA module delivers the most significant gain, boosting mAP50 by 4.0%, which underscores the importance of intelligent feature fusion and attention in complex aerial scenes. The Inner-FocIoU loss provides a consistent, parameter-free improvement of 1.2% in mAP50. Notably, when combined, TKESA and SPDConv achieve a synergistic effect, and the full STF-YOLOv8n model (integrating all three) achieves the best performance with a 5.8% mAP50 increase over the baseline, while even reducing the total parameter count.

2.2 Comparison with State-of-the-Art Methods

We compared STF-YOLOv8n against several popular object detectors on the VisDrone2019 test set.

Model	mAP50 (%)	mAP50-95 (%)	Params (M)	GFLOPs	FPS
Faster R-CNN	33.7	14.6	193.85	171.5	–
RetinaNet	36.3	15.8	37.72	91.8	–
YOLOv5n	46.2	26.5	2.50	7.1	129
YOLOv8n (Baseline)	46.7	26.8	3.01	8.2	185
YOLOv9s	48.4	27.5	7.19	26.6	86
YOLOv10n	46.0	26.8	2.32	6.7	180
YOLOv11n	46.9	26.9	2.59	6.2	108
STF-YOLOv8n (Ours)	52.5	30.5	2.86	14.2	95

STF-YOLOv8n outperforms all compared models in detection accuracy (mAP50 and mAP50-95) by a considerable margin. It surpasses the popular YOLOv8n baseline by 5.8% in mAP50 and even exceeds the more computationally heavy YOLOv9s. While its GFLOPs are higher than the most lightweight variants (YOLOv5n, YOLOv10n), its parameter count remains very low, and its inference speed of 95 FPS is more than sufficient for real-time processing on UAV drone platforms. This demonstrates an excellent trade-off between accuracy and efficiency for aerial detection tasks.

2.3 Generalization to Challenging Conditions

To assess robustness, we tested STF-YOLOv8n on the HazyDet dataset, which contains images captured under adverse weather conditions like haze and low contrast, simulating real-world UAV drone operational challenges.

Model	mAP50 (%)	mAP50-95 (%)	FPS
YOLOv8n	63.6	44.0	146
STF-YOLOv8n	66.5 (+2.9)	45.1 (+1.1)	103

The results confirm the strong generalization ability of our method. STF-YOLOv8n maintains a significant accuracy advantage over the baseline in degraded visibility conditions, proving that its enhanced feature extraction and attention mechanisms are effective beyond clear-weather scenarios. The maintained high FPS indicates its practical utility for UAV drones operating in diverse environments.

3. Conclusion

This paper presented STF-YOLOv8n, an advanced object detection algorithm specifically designed for the arduous task of analyzing UAV drone imagery. By integrating a Spatial Depth-conversion Convolution (SPDConv) module, a novel Top-K Sparse Attention mechanism with Spatial Attention (TKESA), and a dynamic Inner-FocIoU loss function, the model effectively addresses the core challenges of small target detection, complex background clutter, and object occlusion prevalent in aerial views. Extensive experiments on the VisDrone2019-DET benchmark demonstrate that STF-YOLOv8n achieves state-of-the-art detection accuracy, outperforming the baseline YOLOv8n by 5.8% in mAP50 and other contemporary models, while retaining real-time inference capabilities suitable for UAV drone deployment. Furthermore, its validated robustness on the HazyDat dataset underscores its applicability in real-world, non-ideal conditions. Future work will focus on further optimizing the model for extremely tiny distant targets and exploring additional lightweighting techniques to enhance its suitability for deployment on resource-constrained UAV drone edge devices.