Target detection technology based on onboard vision is fundamental for computer vision tasks in surveying drones, enabling critical functions like recognition, positioning, and tracking. Despite advancements, surveying UAVs face persistent challenges: small target sizes, extreme scale variations, mutual occlusion in complex scenes, and computational constraints on edge devices. To address these, we propose an enhanced YOLOv8n architecture integrating three novel components for efficient multi-scale object detection.

Algorithmic Improvements
SPDs-Conv Module
Traditional strided convolutions in YOLOv8n discard fine-grained features crucial for small objects in surveying UAV imagery. SPDs-Conv replaces stride operations with spatial-to-depth (SPD) transformation followed by depthwise separable convolution. For input feature map $X \in \mathbb{R}^{S \times S \times C_1}$, SPD decomposition with scale factor 2 generates four sub-maps:
$$
\begin{aligned}
f_{0,0} &= X[0:S:2, 0:S:2], \\
f_{1,0} &= X[1:S:2, 0:S:2], \\
f_{0,1} &= X[0:S:2, 1:S:2], \\
f_{1,1} &= X[1:S:2, 1:S:2].
\end{aligned}
$$
Concatenation along depth yields $X’ \in \mathbb{R}^{\frac{S}{2} \times \frac{S}{2} \times 4C_1}$. Depthwise separable convolution then reduces dimensions to $X” \in \mathbb{R}^{\frac{S}{2} \times \frac{S}{2} \times C_2}$, preserving spatial details while minimizing parameters. This enhancement is vital for surveying drones capturing small ground objects.
C2f_SKAttention Module
To handle extreme scale variations in surveying UAV perspectives, we integrate SKAttention into C2f blocks. For feature map $X_i$, dual-branch convolution produces $X^1_i$ (3×3 kernel) and $X^2_i$ (5×5 kernel). Fused features $X’_i = X^1_i + X^2_i$ undergo channel-wise statistics aggregation via global pooling:
$$
s_c = \frac{1}{H \times W} \sum_{i=1}^{H} \sum_{j=1}^{W} X’_c(i,j)
$$
Channel-wise attention weights are dynamically computed as:
$$
a_c = \frac{e^{\mathbf{A}_c \mathbf{z}}}{e^{\mathbf{A}_c \mathbf{z}} + e^{\mathbf{B}_c \mathbf{z}}}, \quad b_c = \frac{e^{\mathbf{B}_c \mathbf{z}}}{e^{\mathbf{A}_c \mathbf{z}} + e^{\mathbf{B}_c \mathbf{z}}}
$$
where $\mathbf{z} = \text{FC}(s)$. Output features $V_c = a_c \cdot X^1_c + b_c \cdot X^2_c$ adaptively emphasize optimal receptive fields for multi-scale targets.
Dynamic Detection Head
For complex backgrounds in surveying drone imagery, we unify detection heads using triplet attention mechanisms:
$$
W(F) = \pi_C(\pi_S(\pi_L(F) \cdot F) \cdot F
$$
– Scale-aware attention $\pi_L$: Fuses multi-level features
– Spatial-aware attention $\pi_S$: Focuses on discriminative regions via deformable convolution
– Task-aware attention $\pi_C$: Adjusts channel importance for classification/regression tasks
This unified head mitigates occlusion effects while enhancing small-object detection for surveying UAVs.
Experimental Validation
Ablation Studies
Tests on VisDrone2019 demonstrate individual contributions:
| Components | mAP@0.5 | Params (M) | GFLOPs | FPS |
|---|---|---|---|---|
| Baseline (YOLOv8n) | 34.6% | 3.1 | 8.1 | 205.6 |
| + SPDs-Conv | 41.6% | 2.9 | 46.1 | 100.6 |
| + C2f_SKAttention | 35.5% | 5.3 | 14.8 | 78.1 |
| + Dynamic Head | 39.1% | 4.7 | 14.7 | 58.9 |
| Full Integration | 49.7% | 6.9 | 100.3 | 41.8 |
SPDs-Conv boosts small-object detection (7% mAP gain), while Dynamic Head excels in occlusion scenarios (4.5% improvement).
Performance Comparison
Benchmarking against state-of-the-art methods for surveying drones:
| Model | mAP@0.5 | Params (M) | FPS |
|---|---|---|---|
| YOLOv5s | 32.3% | 7.2 | 95.2 |
| YOLOv7-tiny | 35.0% | 6.0 | 126.1 |
| UAV-YOLOv8 | 47.0% | 10.3 | 51.0 |
| Drone-YOLO (tiny) | 42.8% | 5.4 | 52.0 |
| Ours | 49.7% | 6.9 | 41.8 |
Our model achieves 15.1% higher mAP than YOLOv8n with minimal parameter growth, demonstrating efficiency for surveying UAV deployments.
Edge Deployment
Implemented on Jetson AGX Orin (64GB RAM, 2048 CUDA cores):
| Model Variant | mAP@0.5 | FPS (TensorRT) |
|---|---|---|
| 256-channel | 46.2% | 31.5 |
| 512-channel | 49.3% | 28.7 |
Real-time inference (≥28 FPS) confirms suitability for surveying drone edge applications.
Conclusion
Our enhanced YOLOv8n framework addresses critical challenges in surveying UAV-based object detection: SPDs-Conv preserves small-object features, C2f_SKAttention dynamically adapts to scale variations, and the unified detection head overcomes occlusion complexities. With 49.7% mAP@0.5 on VisDrone2019 and real-time performance on embedded hardware, this solution significantly advances the operational capabilities of surveying drones in diverse scenarios.
