Robust Object Detection for UAV Aerial Photography via Enhanced YOLOv8s Architecture

1. Introduction

UAV aerial photography presents unique challenges for object detection, including extreme scale variations among targets (e.g., pedestrians vs. vehicles), dense distributions in urban environments, and obscured features due to shadows or occlusion. Conventional detectors struggle with these complexities, resulting in high false negatives and limited robustness. To address this, we propose an optimized YOLOv8s framework tailored for UAV scenarios. Our method integrates:

Deformable Attention (DAttention) to adaptively refine receptive fields for irregularly shaped objects;
Attention-based Intra-scale Feature Interaction (AIFI) for global-local feature fusion;
EIoU Loss for precise bounding box regression.

Experiments on VisDrone2019 demonstrate significant improvements: +2.5% precision, +1.7% recall, and +2.3% mAP over baseline YOLOv8s, with reduced computational costs.

2. Methodology

2.1. Architecture Overview

Our enhanced YOLOv8s comprises:

Backbone: Extracts multi-scale features.
Neck: Aggregates features via PANet.
Head: Generates detection outputs.

Key modifications:

C2F_DAttention replaces the 9th-layer C2F module.
AIFI substitutes the SPPF module.
EIoU Loss optimizes bounding box regression.

2.2. C2F_DAttention Module

UAV targets often exhibit non-rigid geometries. DAttention dynamically adjusts attention regions using offset learning:

Equations:ΔP=S⋅tanh(T(Q)),Xˉ=Bilinear(X,P+ΔP)ΔP=S⋅tanh(T(Q)),Xˉ=Bilinear(X,P+ΔP)Z(m)=σ(Q(m)K(m)d)V(m),F=Concat(Z(1),…,Z(m))WFZ(m)=σ(dQ(m)K(m))V(m),F=Concat(Z(1),…,Z(m))WF

Table 1: Impact of C2F_DAttention on UAV Detection Performance

Model	mAP (%)	Precision (%)	Params (M)
YOLOv8s	47.8	56.7	1.11
+ C2F_DAttention	48.2	58.0	1.14

2.3. AIFI Module

AIFI enhances contextual modeling by fusing intra-scale features via Transformer encoders:

Input: Feature map X∈RH×W×CX∈RH×W×C.
Process:
1. 2D positional encoding.
2. Feature serialization.
3. Transformer-based refinement.

Table 2: AIFI vs. SPPF Efficiency

Module	mAP (%)	Params (M)	GFLOPs
SPPF	47.8	1.11	28.5
AIFI	48.1	1.05	27.9

2.4. EIoU Loss Function

CIoU’s slow convergence and insensitivity to aspect ratios hinder UAV detection. EIoU decouples width/height optimization:

Equation:LEIoU=1−IoU+ρ2(b,bgt)cw2+ch2+ρ2(w,wgt)cw2+ρ2(h,hgt)ch2LEIoU=1−IoU+cw2+ch2ρ2(b,bgt)+cw2ρ2(w,wgt)+ch2ρ2(h,hgt)

where cw,chcw,ch are width/height of the minimal enclosing box.

3. Experiments

3.1. Dataset and Setup

Dataset: VisDrone2019 (6,471 training, 548 validation, 1,610 test images).
Classes: Pedestrians, cars, buses, bicycles, etc.
Metrics: Precision (P), Recall (R), mAP@0.5.
Hardware: NVIDIA A800 GPU, PyTorch 2.1.2.

3.2. Ablation Study

*Table 3: Component-wise Contributions*

Configuration	mAP (%)	P (%)	R (%)	Params (M)
Baseline (YOLOv8s)	47.8	56.7	45.8	1.11
+ DAttention	48.2	58.0	46.2	1.14
+ AIFI	48.1	56.9	46.5	1.05
+ EIoU	48.0	57.5	45.9	1.11
Full Model (Ours)	48.9	58.1	46.6	1.08

3.3. Comparative Analysis

Table 4: Benchmarking on VisDrone2019

Model	mAP (%)	P (%)	R (%)	Params (M)
YOLOv5n	41.5	51.5	39.8	0.25
YOLOv8s	47.8	56.7	45.8	1.11
YOLOv10s	47.4	55.0	45.6	0.80
YOLOv12s	48.1	58.0	45.6	0.92
Ours	48.9	58.1	46.6	1.08

3.4. Class-wise Performance

*Table 5: Per-Class mAP Improvements*

Class	YOLOv8s mAP (%)	Ours mAP (%)
Pedestrian	55.5	56.7
Person	43.2	44.5
Tricycle	35.1	38.3
Awning-tricycle	18.7	21.2
Mean	47.8	48.9

4. Conclusion

Our enhanced YOLOv8s addresses critical challenges in UAV aerial photography:

C2F_DAttention adapts to geometric deformations in crowded scenes.
AIFI captures global context while reducing parameters by 2.7%.
EIoU Loss accelerates convergence and improves box accuracy.

The model excels in low-light conditions, occluded targets, and multi-scale scenarios. Future work will optimize real-time inference for edge devices deployed on UAVs.