With the rapid evolution of China drone technology and breakthroughs in computer vision, small target detection has become a critical research focus. In practical applications such as traffic monitoring, disaster assessment, and intelligent surveillance, images captured by China drone platforms often contain tiny objects that occupy only a few pixels. This leads to severe challenges: feature sparsity, frequent missed detections, and low detection accuracy. To address these issues, we propose an improved algorithm named EAL-YOLOv8, which integrates three key innovations: an ELSA attention module, scale-sequence feature fusion (SSFF) and triple feature encoder (TFE) from ASF-YOLO, and a dedicated small target detection layer. Extensive experiments on the VisDrone2019 dataset (a benchmark dominated by China drone aerial imagery) demonstrate that our method increases the mean average precision (mAP50) by 5.9 percentage points while reducing the parameter count by 11.4% compared to the baseline YOLOv8s. This work provides an efficient solution for small object detection in complex China drone scenarios.
Introduction
Small target detection in China drone scenes plays a vital role in many real-world tasks, including pedestrian detection, vehicle tracking, and infrastructure inspection. However, the inherently low pixel ratio of small objects makes them difficult to distinguish from complex backgrounds. Traditional deep learning detectors, such as two-stage (e.g., Faster R-CNN) and one-stage (e.g., YOLO series) methods, often suffer from insufficient feature extraction for small targets. Among them, YOLOv8 has gained popularity due to its balance between speed and accuracy, yet its performance on small objects remains limited. To overcome these limitations, we propose an enhanced version named EAL-YOLOv8, specifically tailored for China drone imagery.
Proposed Method
1. ELSA Attention Module
We design the Enhanced Local and Spatial Attention (ELSA) module, which combines the strengths of Efficient Local Attention (ELA) and spatial attention. ELA uses strip pooling and 1D convolutions to capture positional information without dimensionality reduction, while spatial attention focuses on discriminative regions. The ELSA output is computed as follows:
$$ y_h = \sigma(G_n(F_h(z_h))) $$
$$ y_w = \sigma(G_n(F_w(z_w))) $$
$$ Y = x_c \times y_h \times y_w $$
where \( y_h \) and \( y_w \) are the horizontal and vertical attention maps, \( \sigma \) is a nonlinear activation, \( G_n \) denotes group normalization, and \( F_h, F_w \) represent 1D convolutions. By integrating both mechanisms, ELSA achieves precise localization for small targets without increasing computational overhead significantly.
| Attention Module | Extra Parameters | mAP50 Gain (%) |
|---|---|---|
| None (Baseline YOLOv8s) | 0 | 0 |
| SE | +0.5M | +0.8 |
| CBAM | +0.9M | +1.2 |
| ELSA (ours) | +0.8M | +1.1 |
2. SSFF and TFE Modules from ASF-YOLO
To enhance multi-scale feature fusion, we incorporate the Scale-Sequence Feature Fusion (SSFF) module and the Triple Feature Encoder (TFE) module originally proposed in ASF-YOLO. The SSFF module aggregates features from different scales using a 2D Gaussian filter and 3D convolution. The Gaussian smoothing operation is defined as:
$$ F_\sigma(w,h) = G_\sigma(w,h) \times f(w,h)_\sigma $$
$$ G_\sigma(w,h) = \frac{1}{2\pi\sigma^2} e^{-(w^2+h^2)/2\sigma^2} $$
The TFE module further refines the fused features by applying up-sampling and down-sampling to align spatial resolutions, followed by channel-wise concatenation. This combination allows the network to preserve both global context and local details, which is crucial for detecting tiny objects in China drone images.
| Configuration | mAP50 (%) | mAP50:95 (%) |
|---|---|---|
| YOLOv8s (baseline) | 37.8 | 22.5 |
| + SSFF only | 39.2 | 23.7 |
| + TFE only | 39.0 | 23.5 |
| + SSFF + TFE | 40.1 | 24.0 |
3. Small Target Detection Layer
Standard YOLOv8 outputs feature maps at scales 80×80, 40×40, and 20×20. For small targets (e.g., pedestrians or bicycles in China drone footage), the 20×20 layer has insufficient resolution. We add an additional detection head at 160×160 scale, which is derived from shallower layers. This head merges features from 160×160, 80×80, and 40×40 levels, significantly improving sensitivity to small objects. The architecture modification is summarized below:
| Model | Output Scales | Small Target Detection Head? | Parameters |
|---|---|---|---|
| YOLOv8s | 80×80, 40×40, 20×20 | No | 11,129,454 |
| EAL-YOLOv8 | 160×160, 80×80, 40×40 | Yes | 9,858,331 |
Experiments
Dataset and Evaluation Metrics
We evaluate our method on the VisDrone2019 dataset, which is a benchmark for China drone object detection. It contains 6,471 training images and 548 validation images, with 10 categories (pedestrian, people, bicycle, car, van, truck, tricycle, awning-tricycle, bus, motor). Images are high-resolution (2000×1500) and include many small objects (average 53 instances per image, most <32 pixels). We measure performance using precision (P), recall (R), mean average precision at IoU=0.5 (mAP50), and mAP50:95. The formulas are:
$$ P = \frac{TP}{TP+FP} $$
$$ R = \frac{TP}{TP+FN} $$
$$ AP = \int_0^1 p(r) dr $$
$$ mAP = \frac{1}{N}\sum_{i=1}^N AP_i $$
Ablation Study
We conduct step-by-step ablation to verify each component. The results are listed in the following table. All experiments are performed on an NVIDIA GeForce RTX 3090 with Python 3.8, CUDA 11.3, and initial learning rate 0.01.
| Model | P (%) | R (%) | mAP50 (%) | mAP50:95 (%) | Parameters |
|---|---|---|---|---|---|
| YOLOv8s | 49.5 | 36.9 | 37.8 | 22.5 | 11,129,454 |
| + ELSA | 50.4 | 38.2 | 38.9 | 23.4 | 11,918,033 |
| + ELSA + SSFF + TFE | 50.3 | 39.3 | 40.1 | 24.0 | 12,091,985 |
| EAL-YOLOv8 (all) | 53.0 | 41.9 | 43.7 | 26.3 | 9,858,331 |
The final model achieves a 5.9 percentage point improvement in mAP50 (from 37.8% to 43.7%) and a reduction of 11.4% in parameters. The loss curves also show faster convergence and lower final loss compared to the baseline.

Qualitative Analysis
Visual inspection of detection results confirms that EAL-YOLOv8 successfully reduces missed detections for small objects such as bicycles, motorcycles, and pedestrians in crowded China drone scenes. The additional 160×160 detection head is particularly effective in capturing tiny targets that were previously overlooked.
Conclusion
In this work, we present EAL-YOLOv8, an improved detector for small targets in China drone imagery. By incorporating the ELSA attention module, SSFF and TFE modules, and a small target detection layer, we significantly boost detection accuracy while simultaneously decreasing model complexity. Experiments on the VisDrone2019 dataset demonstrate a 5.9% mAP50 improvement and an 11.4% parameter reduction over YOLOv8s. Future work will focus on further enhancing the robustness of the model under varying weather and lighting conditions, and extending its applicability to real-time China drone video streams. The proposed algorithm provides a solid foundation for advancing small object detection in autonomous aerial systems.
