Robust Object Detection for UAV Aerial Photography via Enhanced YOLOv8s Architecture

1. Introduction

UAV aerial photography presents unique challenges for object detection, including extreme scale variations among targets (e.g., pedestrians vs. vehicles), dense distributions in urban environments, and obscured features due to shadows or occlusion. Conventional detectors struggle with these complexities, resulting in high false negatives and limited robustness. To address this, we propose an optimized YOLOv8s framework tailored for UAV scenarios. Our method integrates:

  • Deformable Attention (DAttention) to adaptively refine receptive fields for irregularly shaped objects;
  • Attention-based Intra-scale Feature Interaction (AIFI) for global-local feature fusion;
  • EIoU Loss for precise bounding box regression.

Experiments on VisDrone2019 demonstrate significant improvements: +2.5% precision+1.7% recall, and +2.3% mAP over baseline YOLOv8s, with reduced computational costs.


2. Methodology

2.1. Architecture Overview

Our enhanced YOLOv8s comprises:

  • Backbone: Extracts multi-scale features.
  • Neck: Aggregates features via PANet.
  • Head: Generates detection outputs.

Key modifications:

  1. C2F_DAttention replaces the 9th-layer C2F module.
  2. AIFI substitutes the SPPF module.
  3. EIoU Loss optimizes bounding box regression.

2.2. C2F_DAttention Module

UAV targets often exhibit non-rigid geometries. DAttention dynamically adjusts attention regions using offset learning:

Equations:ΔP=S⋅tanh(T(Q)),Xˉ=Bilinear(X,P+ΔP)ΔP=S⋅tanh(T(Q)),Xˉ=Bilinear(X,PP)Z(m)=σ(Q(m)K(m)d)V(m),F=Concat(Z(1),…,Z(m))WFZ(m)=σ(dQ(m)K(m)​)V(m),F=Concat(Z(1),…,Z(m))WF

Table 1: Impact of C2F_DAttention on UAV Detection Performance

ModelmAP (%)Precision (%)Params (M)
YOLOv8s47.856.71.11
+ C2F_DAttention48.258.01.14

2.3. AIFI Module

AIFI enhances contextual modeling by fusing intra-scale features via Transformer encoders:

  • Input: Feature map X∈RH×W×CX∈RH×W×C.
  • Process:
    1. 2D positional encoding.
    2. Feature serialization.
    3. Transformer-based refinement.

Table 2: AIFI vs. SPPF Efficiency

ModulemAP (%)Params (M)GFLOPs
SPPF47.81.1128.5
AIFI48.11.0527.9

2.4. EIoU Loss Function

CIoU’s slow convergence and insensitivity to aspect ratios hinder UAV detection. EIoU decouples width/height optimization:

Equation:LEIoU=1−IoU+ρ2(b,bgt)cw2+ch2+ρ2(w,wgt)cw2+ρ2(h,hgt)ch2LEIoU​=1−IoU+cw2​+ch2​ρ2(b,bgt)​+cw2​ρ2(w,wgt)​+ch2​ρ2(h,hgt)​

where cw,chcw​,ch​ are width/height of the minimal enclosing box.


3. Experiments

3.1. Dataset and Setup

  • Dataset: VisDrone2019 (6,471 training, 548 validation, 1,610 test images).
  • Classes: Pedestrians, cars, buses, bicycles, etc.
  • Metrics: Precision (P), Recall (R), mAP@0.5.
  • Hardware: NVIDIA A800 GPU, PyTorch 2.1.2.

3.2. Ablation Study

*Table 3: Component-wise Contributions*

ConfigurationmAP (%)P (%)R (%)Params (M)
Baseline (YOLOv8s)47.856.745.81.11
+ DAttention48.258.046.21.14
+ AIFI48.156.946.51.05
+ EIoU48.057.545.91.11
Full Model (Ours)48.958.146.61.08

3.3. Comparative Analysis

Table 4: Benchmarking on VisDrone2019

ModelmAP (%)P (%)R (%)Params (M)
YOLOv5n41.551.539.80.25
YOLOv8s47.856.745.81.11
YOLOv10s47.455.045.60.80
YOLOv12s48.158.045.60.92
Ours48.958.146.61.08

3.4. Class-wise Performance

*Table 5: Per-Class mAP Improvements*

ClassYOLOv8s mAP (%)Ours mAP (%)
Pedestrian55.556.7
Person43.244.5
Tricycle35.138.3
Awning-tricycle18.721.2
Mean47.848.9

4. Conclusion

Our enhanced YOLOv8s addresses critical challenges in UAV aerial photography:

  1. C2F_DAttention adapts to geometric deformations in crowded scenes.
  2. AIFI captures global context while reducing parameters by 2.7%.
  3. EIoU Loss accelerates convergence and improves box accuracy.

The model excels in low-light conditions, occluded targets, and multi-scale scenarios. Future work will optimize real-time inference for edge devices deployed on UAVs.

Scroll to Top