Improved EAL-YOLOv8 for Small Target Detection in UAV Scenarios

In recent years, the rapid development of China UAV technology has driven the widespread application of small target detection in fields such as intelligent transportation, disaster assessment, and surveillance. However, detecting small objects in UAV aerial images remains challenging due to their low pixel occupancy, complex backgrounds, and insufficient feature information. To address these issues, we propose an enhanced YOLOv8 algorithm named EAL-YOLOv8, which integrates an Efficient Local and Spatial Attention (ELSA) module, a Scale Sequence Feature Fusion (SSFF) module, a Triple Feature Encoder (TFE) module from ASF-YOLO, and an additional small target detection layer. Extensive experiments on the VisDrone2019 dataset demonstrate that our method achieves a 5.9 percentage point improvement in mAP50 while reducing the parameter count by 11.4% compared to the baseline YOLOv8s. This work contributes to advancing China UAV perception systems by enabling more accurate and efficient small object detection.

1. Introduction

With the increasing deployment of China UAV platforms in civil and military applications, the demand for real-time and precise object detection from aerial perspectives has grown significantly. Small targets, such as pedestrians, vehicles, and bicycles, often occupy only a few pixels in high-resolution images, making them prone to missed detections and low accuracy. Traditional two-stage detectors like R-CNN and Fast R-CNN offer high accuracy but suffer from slow inference speed, while one-stage detectors like YOLO and SSD provide faster performance but struggle with small objects. Among them, YOLOv8 has become a popular baseline due to its balance of speed and accuracy. However, its deep feature maps lose fine-grained information essential for small targets.

To overcome these limitations, we propose EAL-YOLOv8, which introduces three key improvements: (1) an ELSA attention mechanism that combines efficient local attention and spatial attention to precisely locate small objects without dimensionality reduction; (2) the integration of SSFF and TFE modules from ASF-YOLO to enhance multi-scale feature fusion and capture local details; (3) an additional 160×160 detection head to preserve shallow features and improve sensitivity to tiny objects. Our contributions are as follows:

  • We propose ELSA, a novel attention block that boosts spatial feature focusing with minimal parameter overhead.
  • We adapt SSFF and TFE into the YOLOv8 neck to strengthen cross-scale information integration.
  • We append a dedicated small target detection layer to significantly reduce missed detections.
  • We achieve state-of-the-art performance on VisDrone2019, a benchmark for China UAV object detection.

2. Related Work

Small target detection has been extensively studied. Earlier works like SSD and YOLOv3 introduced multi-scale predictions, but their performance on tiny objects remains limited. Recent improvements include attention mechanisms (e.g., CBAM, CA, ELA) and feature pyramid enhancements (e.g., BiFPN, ASF-YOLO). For China UAV scenarios, datasets such as VisDrone and UAVDT have driven specialized algorithms. However, many existing methods increase computational cost or parameter count. Our EAL-YOLOv8 strikes a better trade-off by carefully designing lightweight modules.

3. Proposed Method

3.1 Overview of EAL-YOLOv8

The overall architecture of EAL-YOLOv8 is illustrated in Figure 1. It consists of a backbone (Conv + C2f + SPPF), a neck (including SSFF and TFE), and a head with four detection scales (160×160, 80×80, 40×40, 20×20). The ELSA module is inserted after the last C2f block in the backbone to refine spatial attention before feeding into SPPF.

China UAV detection architecture

3.2 ELSA Attention Module

The ELSA module combines Efficient Local Attention (ELA) and Spatial Attention. ELA captures horizontal and vertical positional information using strip pooling and 1D convolutions, avoiding dimensionality reduction. The spatial attention branch learns to focus on informative regions via a separate pathway. The output is given by:

$$
y_h = \sigma(G_n(F_h(z_h)))
$$
$$
y_w = \sigma(G_n(F_w(z_w)))
$$
$$
Y = x_c \times y_h \times y_w
$$

where \(y_h\) and \(y_w\) are the attention maps for horizontal and vertical directions, \(\sigma\) is a nonlinear activation, \(G_n\) is group normalization, and \(F_h, F_w\) are 1D convolutions. The spatial attention branch uses average pooling and max pooling followed by a convolution to produce a spatial weight \(S\). The final ELSA output is:

$$
\text{ELSA}(x) = Y \times S(x)
$$

This design ensures precise localization of small objects while maintaining low parameter count.

3.3 SSFF and TFE Modules

The Scale Sequence Feature Fusion (SSFF) module aggregates multi-scale features from different backbone layers. It first applies 1×1 convolutions to unify channels to 256, then uses nearest-neighbor interpolation to resize feature maps to the same spatial size. The feature maps are stacked along a new dimension and processed by a 3D convolution, batch normalization, and SiLU activation:

$$
F_{\text{SSFF}} = \text{Conv3D}(\text{Concat}( \text{resize}(F_1), \text{resize}(F_2), \text{resize}(F_3) ))
$$

The Triple Feature Encoder (TFE) module preserves local details of small targets. For large feature maps, it applies convolution to reduce channels to 1, followed by max pooling and average pooling. For small feature maps, it uses deconvolution or nearest-neighbor upsampling. All processed maps are then concatenated along the channel dimension:

$$
\text{TFE}(F_{\text{large}}, F_{\text{medium}}, F_{\text{small}}) = \text{Conv}(\text{Concat}(P_{\text{large}}, U_{\text{medium}}, U_{\text{small}}))
$$

where \(P\) denotes pooling operations and \(U\) denotes upsampling. Together, SSFF and TFE enhance the network’s ability to fuse multi-scale information, which is critical for detecting objects of varying sizes in China UAV imagery.

3.4 Small Target Detection Layer

Standard YOLOv8 outputs three scales: 80×80, 40×40, and 20×20. For small targets, the 20×20 feature map has too coarse resolution. We add an additional detection head at 160×160 resolution by exploiting the shallow feature map from the backbone. This head is generated by fusing the early-stage feature (after the first few convolutions) with upsampled deep features via a lightweight C2f block. The new head significantly improves recall for tiny objects (e.g., pedestrians and bicycles) without substantially increasing parameters.

4. Experiments

4.1 Experimental Setup

We evaluate our method on the VisDrone2019 dataset, a benchmark for China UAV object detection. The dataset contains 6,471 training images and 548 validation images across 10 classes: pedestrian, people, bicycle, car, van, truck, tricycle, awning-tricycle, bus, and motor. All experiments are conducted on an NVIDIA GeForce RTX 3090 with PyTorch 1.13 and ultralytics 8.0.157. The input size is 640×640, initial learning rate 0.01, and Mosaic augmentation is disabled in the last 10 epochs.

4.2 Evaluation Metrics

We use precision (P), recall (R), mAP50 (mean Average Precision at IoU=0.5), and mAP50:95. The definitions are:

$$
P = \frac{TP}{TP+FP}, \quad R = \frac{TP}{TP+FN}
$$
$$
AP = \int_0^1 p(r) dr, \quad mAP = \frac{1}{C}\sum_{i=1}^{C} AP_i
$$

where \(C\) is the number of classes.

4.3 Ablation Study

We conduct ablation experiments to verify each component. Table 1 summarizes the results. Adding only ELSA improves mAP50 by 1.1% with a slight parameter increase. Adding ELSA + SSFF+TFE yields a 1.2% improvement. The full EAL-YOLOv8 achieves a 5.9% improvement over baseline while reducing parameters by 11.4%.

Table 1: Ablation study on VisDrone2019 val set.
Model P (%) R (%) mAP50 (%) mAP50:95 (%) Params
YOLOv8s (baseline) 49.5 36.9 37.8 22.5 11,129,454
+ELSA 50.4 38.2 38.9 23.4 11,918,033
+ELSA+ASF (SSFF+TFE) 50.3 39.3 40.1 24.0 12,091,985
EAL-YOLOv8 (full) 53.0 41.9 43.7 26.3 9,858,331

4.4 Comparison with State-of-the-Art

We compare EAL-YOLOv8 with several recent small-target detectors on VisDrone2019. Results are shown in Table 2. Our method achieves the best balance between accuracy and model size.

Table 2: Comparison with state-of-the-art methods on VisDrone2019.
Method mAP50 (%) Params (M)
YOLOv5s 34.2 7.3
YOLOv7-tiny 35.6 12.0
SSG-YOLOv7 36.9 11.2
MFF-YOLOv7 38.4 13.5
YOLOv8s (baseline) 37.8 11.1
EAL-YOLOv8 (ours) 43.7 9.9

4.5 Per-Class Performance

Table 3 presents the AP per class. Our method shows significant gains in challenging categories such as “pedestrian”, “bicycle”, and “motor”, which are typical small targets in China UAV datasets.

Table 3: Per-class AP50 (%) on VisDrone2019 val set.
Class YOLOv8s EAL-YOLOv8 Improvement
pedestrian 32.1 38.5 +6.4
people 24.3 29.1 +4.8
bicycle 15.6 20.9 +5.3
car 55.2 59.8 +4.6
van 42.7 47.3 +4.6
truck 36.4 40.2 +3.8
tricycle 28.9 34.1 +5.2
awning-tricycle 20.1 24.8 +4.7
bus 52.3 57.6 +5.3
motor 29.8 35.4 +5.6

4.6 Loss and Convergence

Figure 2 (not shown) illustrates the training loss curves. EAL-YOLOv8 converges faster and achieves a lower final loss than YOLOv8s, indicating better feature learning. The loss function used is CIoU combined with classification loss.

5. Conclusion

In this paper, we presented EAL-YOLOv8, an improved small target detection algorithm tailored for China UAV applications. By incorporating the ELSA attention module, SSFF and TFE modules, and an additional small target detection layer, our method significantly enhances detection accuracy while reducing model parameters. Experiments on the VisDrone2019 benchmark demonstrate a 5.9% mAP50 improvement over YOLOv8s with an 11.4% reduction in parameters. Future work will explore deploying EAL-YOLOv8 on embedded China UAV platforms and extending it to multi-task learning for real-time scene understanding.

Acknowledgment: This work was supported by the Sichuan Provincial Science and Technology Department (Grant No. 2022JDR0043) and the Key Laboratory of Numerical Simulation of Sichuan Province (Grant No. KLNS-2023SZFZ002).

Scroll to Top