Improved YOLOv8-Based Small Target Detection Algorithm for Unmanned Drone Aerial Images

With the rapid advancement of technology, unmanned drones have become widely used in various fields such as military reconnaissance, traffic monitoring, agricultural surveys, and disaster rescue due to their high flexibility, efficiency, and low cost. However, unmanned drones still face numerous challenges in detecting small targets, such as insufficient model lightweighting and suboptimal detection performance. Current object detection algorithms are primarily based on deep learning methods, which can be divided into two-stage and one-stage approaches. Two-stage methods first generate candidate regions and then classify and regress them, such as Faster-RCNN. One-stage methods directly perform regression and classification on the entire image without generating candidate regions, such as the YOLO series. YOLOv8 is a state-of-the-art object detection model that has significantly improved detection efficiency and performance through various enhancements. However, despite its advancements, YOLOv8 still has limitations in small target detection tasks. For instance, operations like downsampling during feature extraction can lead to the loss of shallow features, neglecting critical details of small targets. Its multi-scale fusion capability is insufficient, with defects in cross-level feature interactions, making it difficult to effectively integrate information for small targets in scenarios like dense crowds, leading to missed detections and false alarms. Additionally, the detection head’s feature allocation strategy for small targets needs improvement, as small targets are easily interfered with by background noise, and in complex backgrounds, small target features can be confused with background features, increasing detection errors.

To address these issues, we propose an improved small target detection method for unmanned drone aerial images. Our approach systematically enhances feature fusion, detection head design, and loss function from three dimensions. In terms of feature fusion, the insufficient integration of deep semantics and shallow details is a major cause of small target information loss. Our main contributions are as follows: (1) We introduce two types of feature fusion modules, Expansive Gated Attentive Fusion (EGAF) and Convergent Gated Aggregation Fusion (CGAF), in the neck network. These modules significantly enhance cross-level feature interaction through a bidirectional mutual gating mechanism, enabling the model to extract rich and diverse feature information from small targets without relying on image enhancement or other preprocessing techniques. (2) To address the lack of multi-scale perception in traditional detection heads, we design a Scale-aware Multigranular Fusion Head (SMF-Head). It employs adaptive receptive fields and cross-scale fusion mechanisms to improve performance while ensuring feature extraction efficiency, avoiding the introduction of complex non-standard operators. (3) Traditional IoU’s insensitivity to small target localization is a key factor leading to low bounding box regression accuracy. Therefore, we propose a Digging Gaussian-Weighted IoU (DGW-IoU) loss function, which enhances the sensitivity of bounding box regression to the central regions of small targets through pixel-level Gaussian weighting and a digging mechanism. The overall architecture of our improved algorithm is illustrated below.

The improved algorithm integrates EGAF and CGAF modules in the neck network to enhance feature interaction, adds a P2 output layer to supplement high-resolution details, replaces the original detection head with SMF-Head for multi-scale fusion, and uses DGW-IoU loss for better bounding box regression. In the following sections, we detail each component and present experimental results to validate our approach.

1. Introduction

Unmanned drones have revolutionized various applications by providing aerial perspectives for tasks like surveillance, mapping, and inspection. However, detecting small targets from unmanned drone aerial images remains a challenging problem due to factors such as low pixel occupancy, significant background interference, occlusion, and varying lighting conditions. Small targets, such as pedestrians, vehicles, or bicycles in drone imagery, often occupy only a few pixels, making it difficult for models to extract discriminative features. Additionally, the complex backgrounds in aerial scenes can introduce noise that hampers detection accuracy.

Deep learning-based object detectors, particularly the YOLO series, have shown remarkable success in general object detection. YOLOv8, as the latest iteration, incorporates advancements like decoupled heads and depthwise convolutions to improve efficiency. Yet, it still struggles with small targets because of inherent limitations in feature preservation and fusion. For example, repeated downsampling operations in the backbone network can erase fine-grained details crucial for small targets. Moreover, the feature pyramid networks (FPN) used in YOLOv8 may not effectively integrate multi-scale features, leading to semantic misalignment and poor detection performance for small objects.

To overcome these challenges, we propose a comprehensive improvement to YOLOv8 tailored for unmanned drone aerial small target detection. Our method focuses on enhancing feature fusion through novel modules, redesigning the detection head for better scale awareness, and introducing a specialized loss function for precise bounding box regression. By doing so, we aim to achieve higher detection accuracy while maintaining model lightweightness, which is essential for deployment on resource-constrained edge devices commonly used with unmanned drones.

2. Related Work

Small target detection in unmanned drone imagery has attracted significant research attention. Various methods have been proposed to improve detection performance, including feature enhancement, multi-scale fusion, and loss function optimization. For instance, some approaches incorporate attention mechanisms to highlight small target regions, while others use feature pyramid networks with improved connections to better aggregate features across scales. In terms of loss functions, variants like DIoU, CIoU, and SIoU have been developed to address issues in bounding box regression. However, many of these methods either increase computational complexity or fail to specifically address the unique challenges of small targets in drone images, such as extreme scale variations and background clutter.

Recent improvements to YOLOv8 for drone detection include YOLOv8-CEBI, which introduces multi-scale attention but increases memory access costs, and YOLO-MFL, which stacks multiple compute-intensive components, leading to high computational overhead. Other works like YOLO-DDE use deformable convolutions to enhance feature fusion but may complicate hardware deployment. In contrast, our approach aims to balance accuracy and efficiency by designing lightweight modules that enhance feature interaction without relying on heavy operations. Additionally, we propose a loss function that explicitly considers the characteristics of small targets, such as their central regions being more informative than edges.

3. Methodology

Our improved algorithm builds upon YOLOv8 by modifying the neck network, detection head, and loss function. We first describe the overall architecture, then detail each component.

3.1 Overall Architecture

The baseline YOLOv8 model consists of a backbone for feature extraction, a neck for feature fusion, and a head for detection. We retain the backbone but enhance the neck with EGAF and CGAF modules, add a P2 output layer for higher resolution features, replace the detection head with SMF-Head, and use DGW-IoU loss for training. This structure allows for better preservation and integration of small target features across scales.

3.2 EGAF and CGAF Modules

To address the semantic misalignment and background noise in feature fusion, we propose EGAF and CGAF. Unlike existing methods like PANet or BiFPN that use simple weighted sums, our modules employ a mutual attention mechanism to selectively enhance useful details and suppress noise. Both modules are designed with aligned interactions, receiving three-scale features $F_1, F_2, F_3$ and a residual feature $F_{farthest}$.

CGAF is deployed in shallow layers to maximize spatial details for small targets. It aligns features via convolution and aggregates them as follows:

$$ F_{agg}^{CGAF} = CBS(Cat(F_1^{CGAF}, Up_1(F_2^{CGAF}), Up_2(F_3^{CGAF}))) $$

where $Up$ denotes upsampling, $Cat$ denotes channel concatenation, and $CBS$ represents Conv-BN-SiLU. Then, a Mutual Attention Light (MAL) module is used to incorporate residual features:

$$ F_{out}^{CGAF} = MAL(F_{agg}^{CGAF}, F_{farthest}^{CGAF}, F_{far}^{CGAF}) + \lambda \cdot CBS(F_1^{CGAF}) $$

where $\lambda$ is a learnable parameter initialized to 0.

EGAF is deployed in deeper layers to balance high-level semantics and detail preservation. It aggregates features using both average and max pooling for context:

$$ F_3^{EGAF’} = avgpool(F_3^{EGAF}) + maxpool(F_3^{EGAF}) $$

$$ F_{agg}^{EGAF} = CBS(Cat(F_1^{EGAF}, F_2^{EGAF}, F_3^{EGAF’})) $$

$$ F_{out}^{EGAF} = MAL(F_{agg}^{EGAF}, F_{farthest}^{EGAF}, F_{far}^{EGAF}) + \lambda \cdot CBS(F_2^{EGAF}) $$

The MAL module consists of channel and spatial gates. The channel gate generates attention weights by using features from the other branch, enabling bidirectional screening:

$$ G_{ch}(F) = \sigma(Conv(ReLU(Conv(F)))) $$

where $\sigma$ is the Sigmoid function. The spatial gate generates a shared attention map to suppress background noise:

$$ G_{sp}(F’) = \sigma(Conv(ReLU(Conv(avgpool(F’))))) $$

The final output of MAL is computed as:

$$ MAL(\cdot) = lp \cdot CBS(Cat(F”_{agg}, F”_{far})) + F_{farthest} $$

where $lp$ is a learnable parameter. Through EGAF and CGAF, our model effectively integrates small-scale details with contextual information.

3.3 SMF-Head

The original YOLOv8 detection head uses simple convolutional stacks, which lack multi-scale feature capture and attention mechanisms. This can lead to feature confusion and increased detection errors in high-density small target scenarios. We propose SMF-Head to address these issues. It consists of two main components: Multi-Granularity Micro-Fusion (MGMF) and Scale-aware Token Mixer (STM).

MGMF employs a serial residual structure with Adaptive Receptive Field (ARF) and Dual-Pooling Squeeze Excitation (DPSE) submodules. ARF captures local and broader context without increasing network depth:

$$ F_{ARF} = BN(Conv2d(\|_{r=1}^3 DW_r(X))) $$

where $\|$ denotes stacking after separable convolutions, and $r$ is the dilation rate. DPSE enhances channel sensitivity using both average and max pooling:

$$ z = \frac{avgpool(X) + maxpool(X)}{2} $$

$$ F_{DPSE} = Sigmoid(Conv2d(SiLU(Conv2d(z)))) + X $$

STM uses a serial residual structure with Cross Scale Fusion (CSF) for dynamic interaction. CSF simulates self-attention efficiently using convolutions:

$$ Q, K, V = Conv2d_{q,k,v}(X) $$

$$ X_{CSF} = Conv2d(Conv2d(V \odot \sigma(\sum_c (Q_c \odot K)))) $$

where $\odot$ denotes element-wise multiplication, and $\sum_c$ denotes summation along channels. The final output of STM is:

$$ F_{main} = Conv2d(CBS(X_{CSF} + X)) $$

$$ F_{gate} = Sigmoid(Conv2d(avgpool(X))) $$

$$ X_{out} = X + F_{main} \odot F_{gate} $$

SMF-Head thus enables multi-scale feature fusion and semi-decoupled classification and regression, improving detection accuracy for small targets from unmanned drones.

3.4 DGW-IoU Loss

Traditional IoU-based losses assume all pixels in a bounding box are equally important, which is not suitable for small targets. For small targets, edge pixels are more susceptible to background noise, and center deviations can cause large IoU variations. We propose DGW-IoU to address these issues. It introduces pixel-level Gaussian weighting and a digging mechanism to focus on central regions.

First, a 2D Gaussian kernel is defined centered at the ground truth box center $c = (c_x, c_y)$:

$$ G(x, y; c, \sigma) = \exp\left(-\frac{(x-c_x)^2 + (y-c_y)^2}{2\sigma^2}\right) $$

The Gaussian-weighted area for a rectangle $B$ is computed via the error function:

$$ \mathcal{I}(B; c, \sigma) = \frac{\pi\sigma^2}{2} \Delta erf_x(B; \sigma) \Delta erf_y(B; \sigma) $$

where $\Delta erf_x(B; \sigma) = erf\left(\frac{x_2 – c_x}{\sqrt{2}\sigma}\right) – erf\left(\frac{x_1 – c_x}{\sqrt{2}\sigma}\right)$. To enhance tolerance to center deviations, a digging coefficient $\kappa \in (0,1)$ is introduced:

$$ \mathcal{I}_{exc}(B; c, \sigma, \kappa) = \mathcal{I}(B; c, \sigma) – \mathcal{I}(B; c, \kappa\sigma) $$

The kernel width is set adaptively based on ground truth box dimensions:

$$ \sigma_{gt} = \eta_{gt} \sqrt{w_{gt} h_{gt}} $$

where $\eta_{gt}$ is a scale-adaptive coefficient set to 0.5. Then, the weighted areas for prediction box $B_{pred}$, ground truth box $B_{gt}$, and their intersection $B_{int}$ are computed:

$$ S_{pred} = \mathcal{I}(B_{pred}; c, \sigma_{gt}) $$

$$ S_{gt} = \mathcal{I}_{exc}(B_{gt}; c, \sigma_{gt}, \kappa) $$

$$ S_{int} = \mathcal{I}(B_{int}; c, \sigma_{gt}) $$

The DGW-IoU is defined as:

$$ IoU_G = \frac{S_{int}}{S_{pred} + S_{gt} – S_{int} + \varepsilon} $$

$$ \mathcal{L}_{DGW-IoU} = -\log(IoU_G) $$

where $\varepsilon$ is a small constant for numerical stability. This loss function improves bounding box regression for small targets by emphasizing central pixels and reducing the impact of edge noise.

4. Experiments

We evaluate our improved algorithm on the VisDrone2019 dataset, which contains realistic unmanned drone aerial images with small targets like pedestrians, vehicles, and bicycles. The dataset presents challenges such as varying scales, occlusion, and complex backgrounds. We compare our method with baseline YOLOv8 and other state-of-the-art approaches to demonstrate its effectiveness.

4.1 Experimental Setup

All experiments are conducted using input images resized to 640×640 pixels. Training is performed for 300 epochs with a batch size of 16. We use GPU T4 x2, Python 3.12.4, PyTorch 2.5.1, and CUDA 12.4. The evaluation metrics include precision (P), recall (R), mean average precision at IoU threshold 0.5 (mAP50), and mAP over IoU thresholds from 0.5 to 0.95 with step 0.05 (mAP50:95). We also report parameters (Params) and GFLOPs to assess model complexity.

4.2 Ablation Study

We conduct ablation experiments to analyze the contribution of each component. The results are summarized in Table 1.

ID	YOLOv8n	P2	EGAF	CGAF	SMF-Head	DGW-IoU	P (%)	R (%)	mAP50	mAP50:95	Param (M)	GFLOPs
1	√						43.4	33.3	32.9	19.0	3.15	8.7
2	√		√				44.2	34.3	34.1	20.1	3.29	10.0
3	√			√			45.3	34.0	34.4	20.4	3.41	10.0
4	√				√		43.8	32.9	32.6	19.2	2.61	7.0
5	√					√	44.3	32.7	32.8	19.0	3.15	8.7
6	√	√					48.4	38.4	38.4	21.9	3.35	17.2
7	√	√	√				49.3	38.7	39.6	22.3	3.53	21.5
8	√	√	√	√			51.6	39.2	41.3	24.8	1.49	23.4
9	√	√	√	√	√		51.4	40.5	42.2	25.6	1.19	17.9
10	√	√	√	√	√	√	52.7	41.2	42.7	25.7	1.19	17.9

From Table 1, we observe that each component contributes to performance improvement. Adding EGAF (ID 2) increases mAP50 by 1.2 percentage points, while CGAF (ID 3) boosts mAP50 by 1.5 percentage points. SMF-Head (ID 4) reduces parameters and GFLOPs while maintaining accuracy. DGW-IoU (ID 5) improves precision by 0.9 percentage points. When combined with P2 layer and all modules (ID 10), our full model achieves the best results with mAP50 of 42.7% and mAP50:95 of 25.7%, representing significant gains over the baseline while reducing parameters by 59.4%.

4.3 Comparison with State-of-the-Art Methods

We compare our method with recent baseline models and improved algorithms on the VisDrone2019 validation set. The results are shown in Table 2.

Method	P (%)	R (%)	mAP50	mAP50:95	Param (M)	GFLOPs	Size (MB)
YOLOv8n	44.9	33.6	32.9	19.0	3.1	8.7	6.4
YOLOv5s	50.4	38.3	37.8	22.4	9.1	24.0	14.4
YOLO11s	48.8	38.0	37.9	22.9	9.4	21.6	18.2
YOLOv10s	50.1	39.5	38.4	23.1	8.0	24.8	16.1
YOLOv8s	49.2	37.8	38.7	23.0	11.1	28.5	21.4
TA-YOLO-n	50.2	38.9	40.1	24.1	3.8	14.1	–
YOLO-DDEs	–	–	40.3	24.3	10.5	31.8	–
YOLOv8s-CEBI	51.3	39.2	40.6	24.4	5.6	20.9	–
FCDM-YOLOv8n	52.6	38.9	41.1	24.3	1.8	23.7	3.9
Ours-n	52.7	41.2	42.7	25.7	1.2	17.9	2.9
DD-YOLO	54.3	42.3	43.9	26.7	9.6	30.4	18.7
YOLO-RC	55.0	42.9	44.7	27.7	9.4	27.8	–
YOLO-MFL	53.8	43.7	45.3	27.2	12.0	64.1	–
TA-YOLO-s	–	–	45.4	27.7	13.9	43.3	–
HM-YOLOs	56.5	43.4	46.2	28.4	8.3	35.0	17.5
DI-YOLO	–	–	47.1	29.0	26.1	96.3	–
FCDM-YOLOv8s	57.2	45.0	47.6	28.6	–	–	–
BF-YOLO	–	–	48.1	–	3.7	57.1	–
Ours-s	57.4	46.5	49.1	30.1	3.4	49.5	7.0

Our Ours-n model achieves mAP50 of 42.7% and mAP50:95 of 25.7%, outperforming YOLOv8n by 9.8% and 6.7% respectively, while reducing parameters by 59.4% and model size by 55%. It also surpasses larger models like YOLOv8s-CEBI and YOLO-DDEs. The Ours-s model achieves mAP50 of 49.1% and mAP50:95 of 30.1%, exceeding YOLOv8s by 10.4% and 7.1%, and outperforming other advanced methods like YOLO-MFL, DI-YOLO, and BF-YOLO in accuracy with lower computational costs.

4.4 Loss Function Comparison

We compare DGW-IoU with other popular loss functions using our improved framework. The results are presented in Table 3.

Method	P (%)	R (%)	mAP50	mAP50:95
CIoU	51.4	40.5	42.2	25.6
EIoU	51.8	39.8	41.7	25.3
SIoU	51.7	40.1	42.2	25.7
NWD	52.5	40.9	42.5	25.5
Focaler IoU	52.5	40.0	42.4	25.0
DGW-IoU (ours)	52.7	41.2	42.7	25.7

DGW-IoU achieves the highest mAP50 and recall among all loss functions, demonstrating its effectiveness for small target detection. Compared to NWD, which also uses Gaussian distributions, DGW-IoU improves mAP50 by 0.2 percentage points and recall by 0.3 percentage points, indicating better handling of small target characteristics.

4.5 Inference Speed Analysis

We measure inference speed on a single GPU T4. The results are shown in Table 4.

Method	Pre (ms)	Infer (ms)	Post (ms)	Total (ms)
YOLOv8n	1.7	6.2	1.8	9.7
Ours-n	1.7	22.2	1.3	25.2

While our model has higher inference time due to added modules, it remains feasible for real-time applications on unmanned drones, especially given its accuracy improvements and reduced parameter count.

4.6 Visualization Analysis

We visualize detection results with a confidence threshold of 0.25. Comparisons between YOLOv8n and our improved algorithm show that our method better detects small targets in various challenging scenarios, such as under strong sunlight, shadow occlusion, dense crowds, and low-light conditions. For example, in sunny scenes, our algorithm captures more deformed pedestrian targets and reduces misclassifications like windows as cars or people as motors. In shadowed environments, it detects more motor targets missed by the baseline. In dense overlapping scenes, it lowers missed detection rates, and in night scenes, it exhibits superior low-light detection performance. These visual results confirm the robustness of our approach for unmanned drone aerial images.

5. Conclusion

In this work, we address the challenges of small target detection in unmanned drone aerial images by proposing an improved YOLOv8-based algorithm. We identify limitations in shallow information preservation and multi-scale fusion in the baseline model and introduce systematic enhancements. Specifically, we develop EGAF and CGAF modules for better cross-level feature interaction, add a P2 output layer for high-resolution details, design SMF-Head for multi-scale fusion, and propose DGW-IoU loss for improved bounding box regression. Experiments on the VisDrone2019 dataset demonstrate that our method significantly improves detection accuracy: mAP50 increases by 9.8% and mAP50:95 by 6.7% compared to YOLOv8n, while reducing parameters by 59.4%. These results validate the effectiveness of our approach for small target detection and its suitability for edge deployment on unmanned drones. Future work will focus on further reducing computational overhead and improving inference speed without compromising accuracy.