Enhanced YOLOv8 for Small Object Detection in Drone Imagery

In the rapidly evolving field of drone technology, the demand for efficient and accurate object detection in aerial imagery has become paramount. As researchers dedicated to advancing aerial surveillance and autonomous navigation, we have developed a novel methodology to address a persistent challenge: the detection of small objects in drone-captured images. Traditional deep learning models, while effective for general-purpose tasks, often struggle with the unique characteristics of drone footage, such as low pixel occupancy, significant background interference, and the high density of targets. In this comprehensive study, we present a systematic enhancement of the YOLOv8 baseline model, focusing on feature fusion, detection head design, and loss function optimization. Our contributions are grounded in the principles of drone technology, aiming to improve both accuracy and efficiency in real-world applications.

The core of our work lies in the architectural improvements to the neck network and detection head of YOLOv8. We introduce two novel feature fusion modules: the Expansive Gated Attentive Fusion (EGAF) and the Convergent Gated Aggregation Fusion (CGAF). These modules are strategically placed within the network to enhance cross-layer interaction, ensuring that shallow, high-resolution features containing critical details for small objects are effectively combined with deeper, semantically rich features. Unlike traditional methods that rely on simple weighted summations, our approach employs a bidirectional mutual gating mechanism to selectively amplify informative features while suppressing background noise. This is particularly beneficial for drone technology, where images often contain cluttered backgrounds and occlusions.

To further augment the model’s capability for small object detection, we introduce a new P2 output layer. This layer adds a high-resolution pathway to the feature pyramid, allowing the network to capture finer spatial details that are essential for identifying objects that occupy only a few pixels in the image. The integration of the P2 layer, combined with our EGAF and CGAF modules, creates a robust multi-scale feature representation that is highly sensitive to small-scale targets.

The detection head is another critical component that we have redesigned. Our Scale-aware Multigranular Fusion Head (SMF-Head) replaces the conventional decoupled head found in YOLOv8. The SMF-Head is designed to perform multi-granularity feature fusion, enabling the model to adaptively focus on different scales of objects. It incorporates a series of lightweight attention mechanisms and convolutional operations that do not introduce significant computational overhead. This is a crucial aspect of drone technology, where onboard processing power is often limited. By maintaining a balance between performance and efficiency, our SMF-Head ensures that the model can be deployed on edge devices without compromising on detection accuracy.

Finally, we address the limitations of traditional Intersection over Union (IoU) loss functions for small object regression. We propose the Digging Gaussian-Weighted IoU (DGW-IoU) loss, which introduces a pixel-level Gaussian weighting scheme centered on the ground-truth bounding box. This weight map prioritizes the core region of the object, reducing the influence of noisy pixels at the boundary. Furthermore, a unique “digging” mechanism is applied, creating a crater-like weight distribution that enhances tolerance for minor centroid shifts in tiny objects. This innovation is a direct response to the specific challenges of drone technology, where small objects often appear as blobs with ambiguous edges.

The above image illustrates a typical Unmanned Aerial Vehicle (UAV) platform, which is central to the field of drone technology. The integration of advanced algorithms like ours into such platforms can significantly enhance autonomous capabilities, enabling missions such as traffic monitoring, agricultural survey, and disaster response with greater precision.

Our methodology is validated through extensive experiments on the VisDrone2019 dataset, a benchmark specifically designed for drone technology applications. This dataset includes challenging scenarios such as varying lighting conditions, severe occlusions, and high-density object clusters. The experimental results are presented in the following sections, demonstrating the efficacy of our approach.

Methodology

We base our improvements on the YOLOv8n framework. The overall architecture consists of a backbone for feature extraction, a neck for feature fusion, and a head for prediction. Our modifications are concentrated in the neck and head, as illustrated in the architectural diagram (which is described textually here). The backbone remains largely unchanged, leveraging its efficient feature extraction capabilities. The key enhancements are as follows:

1. EGAF and CGAF Feature Fusion Modules

To address the semantic gap between high-level and low-level features, we propose two distinct fusion modules: EGAF for deep layers and CGAF for shallow layers. Both modules take three input feature maps from different scales and a residual feature from the farthest scale. The core of these modules is the Mutual Attention Light (MAL) block, which employs channel and spatial gating mechanisms. The operations for CGAF are defined as:

$$F_{agg}^{CGAF} = CBS(Cat(F_1^{CGAF}, Up_1(F_2^{CGAF}), Up_2(F_3^{CGAF})))$$

$$F_{out}^{CGAF} = MAL(F_{agg}^{CGAF}, F_{farthest}^{CGAF}, F_{far}^{CGAF}) + \lambda \cdot CBS(F_1^{CGAF})$$

Where CBS denotes a convolution-batch normalization-SiLU block, Cat is concatenation, and Up is upsampling. The parameter λ is learnable and initialized to 0. The EGAF module uses a similar structure but focuses on the medium-resolution feature and incorporates both average and max pooling for context aggregation. The MAL block itself uses channel and spatial gates to selectively fuse information from the aggregated and local residual features, enhancing the representation of small objects. The formulas for channel gate and spatial gate are:

$$G_{ch}(F) = \sigma(Conv(ReLU(Conv(F))))$$

$$G_{sp}(F’) = \sigma(Conv(ReLU(Conv(avgpool(F’)))))$$

Where σ is the sigmoid function. This design ensures that for drone technology, the model can effectively prioritize relevant details while ignoring irrelevant background clutter.

2. Scale-aware Multigranular Fusion Head (SMF-Head)

The SMF-Head is designed to replace the standard YOLOv8 detection head. It consists of a Multi-Granularity Micro-Fusion (MGMF) block and a Scale-aware Token Mixer (STM) block. The MGMF block uses a combination of an Adaptive Receptive Field (ARF) module and a Dual-Pooling Squeeze Excitation (DPSE) module to enhance single-layer features. The ARF employs multiple dilated convolutions to capture multi-scale context, while the DPSE uses a combination of average and max pooling to recalibrate channel-wise features. The STM block, inspired by MetaFormer, performs efficient cross-scale fusion without the computational cost of traditional self-attention. The CSF (Cross-Scale Fusion) operation within STM is defined as:

$$X_{CSF} = Conv2d(Conv2d(V \odot \sigma(\sum_c(Q_c \odot K))))$$

This operation reduces complexity to a linear level, making it suitable for edge deployment in drone technology. The final output of the SMF-Head is shared between classification and regression branches, enabling high-quality predictions for small targets.

3. DGW-IoU Loss Function

To improve the regression accuracy for small objects, we propose DGW-IoU. This loss function uses a 2D Gaussian kernel centered on the ground-truth box to weight pixels within the bounding box. The weighted area of a rectangle B is computed as:

$$\mathcal{I}(B; c, \sigma) = \iint_{(x,y) \in B} exp\bigg(-\frac{(x-c_x)^2 + (y-c_y)^2}{2\sigma^2}\bigg) dx dy$$

This integral has a closed-form solution using the error function, which is computationally efficient. The kernel width is adapted based on the object scale: σ_gt = η_gt √(w_gt h_gt). Furthermore, a “digging” mechanism is applied to the ground-truth area to create a crater-like weight distribution, defined as:

$$\mathcal{I}_{exc}(B; c, \sigma, \kappa) = \mathcal{I}(B; c, \sigma) – \mathcal{I}(B; c, \kappa\sigma)$$

Where κ is the digging coefficient (set to 0.12 for small objects). This increases the model’s tolerance to minor centroid shifts, which is a common issue in drone technology. The final DGW-IoU loss is:

$$\mathcal{L}_{DGW-IoU} = -log\bigg(\frac{S_{int}}{S_{pred} + S_{gt} – S_{int} + \varepsilon}\bigg)$$

This log-based formulation further amplifies the loss for low-overlap scenarios, focusing the learning process on small, difficult objects.

Experiments

We conducted all experiments on the VisDrone2019 dataset, which is specifically tailored for validation in drone technology. Images were resized to 640×640 pixels. The training process ran for 300 epochs using two NVIDIA T4 GPUs, with PyTorch 2.5.1 and CUDA 12.4. The performance metrics include precision (P), recall (R), mean Average Precision at IoU=0.5 (mAP50), and mAP50:95.

Ablation Studies

To evaluate the contribution of each proposed component, we performed a series of ablation experiments. The results are summarized in the table below. The baseline (ID 1) is YOLOv8n. Each component is added incrementally. The final model (ID 10) includes all improvements and demonstrates a significant performance leap.

ID	YOLOv8n	P2	EGAF	CGAF	SMF-Head	DGW-IoU	P (%)	R (%)	mAP50	mAP50:95	Params (M)	GFLOPs
1	✓	–	–	–	–	–	43.4	33.3	32.9	19.0	3.15	8.7
2	✓	–	✓	–	–	–	44.2	34.3	34.1	20.1	3.29	10.0
3	✓	–	–	✓	–	–	45.3	34.0	34.4	20.4	3.41	10.0
4	✓	–	–	–	✓	–	43.8	32.9	32.6	19.2	2.61	7.0
5	✓	–	–	–	–	✓	44.3	32.7	32.8	19.0	3.15	8.7
6	✓	✓	–	–	–	–	48.4	38.4	38.4	21.9	3.35	17.2
7	✓	✓	✓	–	–	–	49.3	38.7	39.6	22.3	3.53	21.5
8	✓	✓	✓	✓	–	–	51.6	39.2	41.3	24.8	1.49	23.4
9	✓	✓	✓	✓	✓	–	51.4	40.5	42.2	25.6	1.19	17.9
10	✓	✓	✓	✓	✓	✓	52.7	41.2	42.7	25.7	1.19	17.9

The results clearly show the synergistic effect of all components. The final model (Ours-n) achieves a 9.8% improvement in mAP50 over the baseline, while also reducing the parameter count by 59.4%. This demonstrates the effectiveness of our design for efficient drone technology applications.

Comparison with State-of-the-Art

We compared our method with various baseline and advanced methods. The table below presents the results for both our lightweight version (Ours-n) and a slightly larger variant (Ours-s). Our methods consistently outperform others in terms of accuracy while maintaining a low parameter count and computational cost.

Method	P (%)	R (%)	mAP50	mAP50:95	Params (M)	GFLOPs
YOLOv8n	44.9	33.6	32.9	19.0	3.1	8.7
YOLOv5s	50.4	38.3	37.8	22.4	9.1	24.0
YOLO11s	48.8	38.0	37.9	22.9	9.4	21.6
YOLOv8s	49.2	37.8	38.7	23.0	11.1	28.5
TA-YOLO-n	50.2	38.9	40.1	24.1	3.8	14.1
YOLOv8s-CEBI	51.3	39.2	40.6	24.4	5.6	20.9
FCDM-YOLOv8n	52.6	38.9	41.1	24.3	1.8	23.7
Ours-n	52.7	41.2	42.7	25.7	1.2	17.9
YOLO-MFL	53.8	43.7	45.3	27.2	12.0	64.1
TA-YOLO-s	–	–	45.4	27.7	13.9	43.3
HM-YOLOs	56.5	43.4	46.2	28.4	8.3	35.0
FCDM-YOLOv8s	57.2	45.0	47.6	28.6	–	–
Ours-s	57.4	46.5	49.1	30.1	3.4	49.5

Our Ours-n model not only surpasses many larger models like YOLOv8s and TA-YOLO-s but also achieves this with significantly fewer parameters and FLOPs. This is a direct testament to the efficiency of our approach, making it highly suitable for real-time applications in drone technology.

Loss Function Comparison

We further evaluate our DGW-IoU loss against other popular loss functions. The results are summarized below.

Method	P (%)	R (%)	mAP50	mAP50:95
CIoU	51.4	40.5	42.2	25.6
EIoU	51.8	39.8	41.7	25.3
SIoU	51.7	40.1	42.2	25.7
NWD	52.5	40.9	42.5	25.5
Focaler IoU	52.5	40.0	42.4	25.0
DGW-IoU (Ours)	52.7	41.2	42.7	25.7

The DGW-IoU loss demonstrates superior performance, particularly in recall and mAP50. Its unique design effectively mitigates the adverse effects of background noise and centroid shift, which are prevalent in drone technology applications.

Inference Speed Analysis

We also tested the inference speed on a single T4 GPU. While our model introduces additional computational cost compared to the extremely lightweight YOLOv8n, it remains efficient for the significant accuracy gains it provides.

Method	Pre (ms)	Infer (ms)	Post (ms)	Total (ms)
YOLOv8n	1.7	6.2	1.8	9.7
Ours-n	1.7	22.2	1.3	25.2

The total inference time of 25.2 ms per image (equivalent to approximately 40 FPS) is adequate for many real-time drone technology tasks, such as surveillance and traffic monitoring.

Conclusion

In this paper, we have presented a comprehensive enhancement of YOLOv8 for small object detection in drone technology. By introducing EGAF and CGAF modules for improved feature fusion, an SMF-Head for multi-scale perception, and a DGW-IoU loss for robust regression, we have significantly improved detection accuracy. Our Ours-n model achieves a 9.8% improvement in mAP@0.5 while reducing model parameters by 59.4% compared to the YOLOv8n baseline. These results validate the effectiveness of our approach for challenging aerial imagery scenarios. Future work will focus on further optimizing the model for even faster inference speeds without compromising the high accuracy required for advanced drone technology applications.