Improved UAV Multi-scale Object Detection Algorithm Based on YOLOv8n for Ground Targets

Target detection technology based on onboard vision is fundamental for computer vision tasks in surveying drones, enabling critical functions like recognition, positioning, and tracking. Despite advancements, surveying UAVs face persistent challenges: small target sizes, extreme scale variations, mutual occlusion in complex scenes, and computational constraints on edge devices. To address these, we propose an enhanced YOLOv8n architecture integrating three novel components for efficient multi-scale object detection.

Algorithmic Improvements

SPDs-Conv Module

Traditional strided convolutions in YOLOv8n discard fine-grained features crucial for small objects in surveying UAV imagery. SPDs-Conv replaces stride operations with spatial-to-depth (SPD) transformation followed by depthwise separable convolution. For input feature map $X \in \mathbb{R}^{S \times S \times C_1}$, SPD decomposition with scale factor 2 generates four sub-maps:

$$
\begin{aligned}
f_{0,0} &= X[0:S:2, 0:S:2], \\
f_{1,0} &= X[1:S:2, 0:S:2], \\
f_{0,1} &= X[0:S:2, 1:S:2], \\
f_{1,1} &= X[1:S:2, 1:S:2].
\end{aligned}
$$

Concatenation along depth yields $X’ \in \mathbb{R}^{\frac{S}{2} \times \frac{S}{2} \times 4C_1}$. Depthwise separable convolution then reduces dimensions to $X” \in \mathbb{R}^{\frac{S}{2} \times \frac{S}{2} \times C_2}$, preserving spatial details while minimizing parameters. This enhancement is vital for surveying drones capturing small ground objects.

C2f_SKAttention Module

To handle extreme scale variations in surveying UAV perspectives, we integrate SKAttention into C2f blocks. For feature map $X_i$, dual-branch convolution produces $X^1_i$ (3×3 kernel) and $X^2_i$ (5×5 kernel). Fused features $X’_i = X^1_i + X^2_i$ undergo channel-wise statistics aggregation via global pooling:

$$
s_c = \frac{1}{H \times W} \sum_{i=1}^{H} \sum_{j=1}^{W} X’_c(i,j)
$$

Channel-wise attention weights are dynamically computed as:

$$
a_c = \frac{e^{\mathbf{A}_c \mathbf{z}}}{e^{\mathbf{A}_c \mathbf{z}} + e^{\mathbf{B}_c \mathbf{z}}}, \quad b_c = \frac{e^{\mathbf{B}_c \mathbf{z}}}{e^{\mathbf{A}_c \mathbf{z}} + e^{\mathbf{B}_c \mathbf{z}}}
$$

where $\mathbf{z} = \text{FC}(s)$. Output features $V_c = a_c \cdot X^1_c + b_c \cdot X^2_c$ adaptively emphasize optimal receptive fields for multi-scale targets.

Dynamic Detection Head

For complex backgrounds in surveying drone imagery, we unify detection heads using triplet attention mechanisms:

$$
W(F) = \pi_C(\pi_S(\pi_L(F) \cdot F) \cdot F
$$

– Scale-aware attention $\pi_L$: Fuses multi-level features
– Spatial-aware attention $\pi_S$: Focuses on discriminative regions via deformable convolution
– Task-aware attention $\pi_C$: Adjusts channel importance for classification/regression tasks

This unified head mitigates occlusion effects while enhancing small-object detection for surveying UAVs.

Experimental Validation

Ablation Studies

Tests on VisDrone2019 demonstrate individual contributions:

Components	mAP@0.5	Params (M)	GFLOPs	FPS
Baseline (YOLOv8n)	34.6%	3.1	8.1	205.6
+ SPDs-Conv	41.6%	2.9	46.1	100.6
+ C2f_SKAttention	35.5%	5.3	14.8	78.1
+ Dynamic Head	39.1%	4.7	14.7	58.9
Full Integration	49.7%	6.9	100.3	41.8

SPDs-Conv boosts small-object detection (7% mAP gain), while Dynamic Head excels in occlusion scenarios (4.5% improvement).

Performance Comparison

Benchmarking against state-of-the-art methods for surveying drones:

Model	mAP@0.5	Params (M)	FPS
YOLOv5s	32.3%	7.2	95.2
YOLOv7-tiny	35.0%	6.0	126.1
UAV-YOLOv8	47.0%	10.3	51.0
Drone-YOLO (tiny)	42.8%	5.4	52.0
Ours	49.7%	6.9	41.8

Our model achieves 15.1% higher mAP than YOLOv8n with minimal parameter growth, demonstrating efficiency for surveying UAV deployments.

Edge Deployment

Implemented on Jetson AGX Orin (64GB RAM, 2048 CUDA cores):

Model Variant	mAP@0.5	FPS (TensorRT)
256-channel	46.2%	31.5
512-channel	49.3%	28.7

Real-time inference (≥28 FPS) confirms suitability for surveying drone edge applications.

Conclusion

Our enhanced YOLOv8n framework addresses critical challenges in surveying UAV-based object detection: SPDs-Conv preserves small-object features, C2f_SKAttention dynamically adapts to scale variations, and the unified detection head overcomes occlusion complexities. With 49.7% mAP@0.5 on VisDrone2019 and real-time performance on embedded hardware, this solution significantly advances the operational capabilities of surveying drones in diverse scenarios.