1. Introduction
UAV aerial photography presents unique challenges for object detection, including extreme scale variations among targets (e.g., pedestrians vs. vehicles), dense distributions in urban environments, and obscured features due to shadows or occlusion. Conventional detectors struggle with these complexities, resulting in high false negatives and limited robustness. To address this, we propose an optimized YOLOv8s framework tailored for UAV scenarios. Our method integrates:

- Deformable Attention (DAttention) to adaptively refine receptive fields for irregularly shaped objects;
- Attention-based Intra-scale Feature Interaction (AIFI) for global-local feature fusion;
- EIoU Loss for precise bounding box regression.
Experiments on VisDrone2019 demonstrate significant improvements: +2.5% precision, +1.7% recall, and +2.3% mAP over baseline YOLOv8s, with reduced computational costs.
2. Methodology
2.1. Architecture Overview
Our enhanced YOLOv8s comprises:
- Backbone: Extracts multi-scale features.
- Neck: Aggregates features via PANet.
- Head: Generates detection outputs.
Key modifications:
- C2F_DAttention replaces the 9th-layer C2F module.
- AIFI substitutes the SPPF module.
- EIoU Loss optimizes bounding box regression.
2.2. C2F_DAttention Module
UAV targets often exhibit non-rigid geometries. DAttention dynamically adjusts attention regions using offset learning:
Equations:ΔP=S⋅tanh(T(Q)),Xˉ=Bilinear(X,P+ΔP)ΔP=S⋅tanh(T(Q)),Xˉ=Bilinear(X,P+ΔP)Z(m)=σ(Q(m)K(m)d)V(m),F=Concat(Z(1),…,Z(m))WFZ(m)=σ(dQ(m)K(m))V(m),F=Concat(Z(1),…,Z(m))WF
Table 1: Impact of C2F_DAttention on UAV Detection Performance
Model | mAP (%) | Precision (%) | Params (M) |
---|---|---|---|
YOLOv8s | 47.8 | 56.7 | 1.11 |
+ C2F_DAttention | 48.2 | 58.0 | 1.14 |
2.3. AIFI Module
AIFI enhances contextual modeling by fusing intra-scale features via Transformer encoders:
- Input: Feature map X∈RH×W×CX∈RH×W×C.
- Process:
- 2D positional encoding.
- Feature serialization.
- Transformer-based refinement.
Table 2: AIFI vs. SPPF Efficiency
Module | mAP (%) | Params (M) | GFLOPs |
---|---|---|---|
SPPF | 47.8 | 1.11 | 28.5 |
AIFI | 48.1 | 1.05 | 27.9 |
2.4. EIoU Loss Function
CIoU’s slow convergence and insensitivity to aspect ratios hinder UAV detection. EIoU decouples width/height optimization:
Equation:LEIoU=1−IoU+ρ2(b,bgt)cw2+ch2+ρ2(w,wgt)cw2+ρ2(h,hgt)ch2LEIoU=1−IoU+cw2+ch2ρ2(b,bgt)+cw2ρ2(w,wgt)+ch2ρ2(h,hgt)
where cw,chcw,ch are width/height of the minimal enclosing box.
3. Experiments
3.1. Dataset and Setup
- Dataset: VisDrone2019 (6,471 training, 548 validation, 1,610 test images).
- Classes: Pedestrians, cars, buses, bicycles, etc.
- Metrics: Precision (P), Recall (R), mAP@0.5.
- Hardware: NVIDIA A800 GPU, PyTorch 2.1.2.
3.2. Ablation Study
*Table 3: Component-wise Contributions*
Configuration | mAP (%) | P (%) | R (%) | Params (M) |
---|---|---|---|---|
Baseline (YOLOv8s) | 47.8 | 56.7 | 45.8 | 1.11 |
+ DAttention | 48.2 | 58.0 | 46.2 | 1.14 |
+ AIFI | 48.1 | 56.9 | 46.5 | 1.05 |
+ EIoU | 48.0 | 57.5 | 45.9 | 1.11 |
Full Model (Ours) | 48.9 | 58.1 | 46.6 | 1.08 |
3.3. Comparative Analysis
Table 4: Benchmarking on VisDrone2019
Model | mAP (%) | P (%) | R (%) | Params (M) |
---|---|---|---|---|
YOLOv5n | 41.5 | 51.5 | 39.8 | 0.25 |
YOLOv8s | 47.8 | 56.7 | 45.8 | 1.11 |
YOLOv10s | 47.4 | 55.0 | 45.6 | 0.80 |
YOLOv12s | 48.1 | 58.0 | 45.6 | 0.92 |
Ours | 48.9 | 58.1 | 46.6 | 1.08 |
3.4. Class-wise Performance
*Table 5: Per-Class mAP Improvements*
Class | YOLOv8s mAP (%) | Ours mAP (%) |
---|---|---|
Pedestrian | 55.5 | 56.7 |
Person | 43.2 | 44.5 |
Tricycle | 35.1 | 38.3 |
Awning-tricycle | 18.7 | 21.2 |
Mean | 47.8 | 48.9 |
4. Conclusion
Our enhanced YOLOv8s addresses critical challenges in UAV aerial photography:
- C2F_DAttention adapts to geometric deformations in crowded scenes.
- AIFI captures global context while reducing parameters by 2.7%.
- EIoU Loss accelerates convergence and improves box accuracy.
The model excels in low-light conditions, occluded targets, and multi-scale scenarios. Future work will optimize real-time inference for edge devices deployed on UAVs.