Enhancing China UAV Drone Vision: A YOLO-GELAN Framework for Aerial Object Detection

The proliferation of China UAV drone technology has revolutionized data acquisition across numerous sectors, including precision agriculture, urban planning, environmental monitoring, and disaster response. Aerial imagery captured by these China UAV drone platforms offers unparalleled spatial coverage and detail. However, this advantage introduces significant challenges for automated analysis, particularly in object detection. Targets within China UAV drone imagery often exhibit extreme scale variation, appear as small, densely packed objects, and are set against complex, cluttered backgrounds. These factors—scale variance, small target size, dense distribution, and background noise—collectively lead to high rates of missed detections and false positives, severely limiting the practical utility of automated systems for China UAV drone applications.

Deep learning-based object detectors have dominated the field, primarily categorized into two-stage and one-stage architectures. Two-stage methods, like Faster R-CNN, first generate region proposals and then classify them, offering high accuracy at the cost of computational speed. One-stage detectors, such as the YOLO (You Only Look Once) series and SSD, perform classification and bounding box regression in a single network pass, providing an optimal balance between speed and accuracy, which is crucial for real-time processing on or for China UAV drone systems. Among these, the YOLO family has been widely adopted for embedded and edge applications. While YOLOv8 presents a robust baseline with improved architecture over its predecessors, its performance on challenging China UAV drone datasets remains suboptimal, especially for small objects where feature information is easily lost during deep convolutional operations.

Recent research has focused on adapting these models for aerial perspectives. Efforts include designing specialized modules for feature fusion, integrating attention mechanisms to focus on salient regions, employing lightweight backbones for efficiency, and developing training strategies tailored for aerial data. For instance, some works have introduced modified feature pyramid networks or attention modules into YOLO frameworks to better handle multi-scale objects common in China UAV drone footage. Others have explored anchor-free designs or advanced loss functions to improve localization accuracy. Despite these advancements, a holistic solution that concurrently addresses parameter efficiency, enhanced small-object feature extraction, and robustness to low-quality training examples is still sought for deployment on resource-constrained China UAV drone hardware.

This paper proposes YOLO-GELAN, an enhanced object detection framework based on YOLOv8s, specifically designed to tackle the unique challenges presented by China UAV drone aerial imagery. Our contributions are threefold: (1) We incorporate the RepNCSPELAN module from the Generalized Efficient Aggregation Network (GELAN) into the backbone, replacing standard modules to strengthen feature extraction capabilities while significantly reducing the parameter count. (2) We structurally modify the neck and head of the network by introducing an additional feature detection layer dedicated to small targets and removing a layer responsible for very large targets, thereby increasing the spatial resolution for small object analysis. (3) We replace the default localization loss with a Wise-IoU variant, which dynamically adjusts gradient assignment based on anchor quality, mitigating the negative impact of poor-quality examples and leading to more stable training and improved generalization. Comprehensive experiments on the demanding VisDrone2019 dataset demonstrate that YOLO-GELAN achieves superior detection accuracy, particularly for small objects, compared to baseline YOLOv8s and other contemporary models, while maintaining a favorable parameter footprint suitable for China UAV drone applications.

Methodology

The proposed YOLO-GELAN framework is built upon the YOLOv8s architecture, chosen for its effective trade-off between accuracy and speed. We introduce targeted modifications to its backbone, neck, and loss function to optimize performance for China UAV drone image analysis.

1. Network Architecture Refinement with GELAN

The original YOLOv8s utilizes C2f modules in its backbone and neck, which are effective for general feature aggregation. To enhance feature representation efficiency specifically for the varied and complex patterns in China UAV drone imagery, we draw inspiration from the Generalized Efficient Aggregation Network (GELAN). GELAN is designed around gradient path planning, combining the strengths of Cross Stage Partial networks and Efficient Layer Aggregation networks. It optimizes information flow and gradient propagation through cross-layer connections, enabling efficient feature extraction with reduced computational complexity.

We integrate the core building block of GELAN, the RepNCSPELAN module, to replace the standard C2f modules in our model. The RepNCSPELAN module employs a series of convolutional layers and a RepNCSP (Replicated Cross Stage Partial) block for hierarchical feature fusion. This design promotes the fusion of both local and global contextual information, which is vital for distinguishing small, ambiguous targets from cluttered backgrounds in China UAV drone scenes. Crucially, this architectural change leads to a more compact model. The RepNCSPELAN achieves stronger feature representation with fewer parameters, directly addressing the need for efficient models deployable on China UAV drone platforms. A comparative schematic of the module integration is shown below, highlighting the structural efficiency.

The parameter reduction can be approximated by the difference in the computational graphs. Let the original C2f module’s parameter count be denoted as $P_{C2f}$. The RepNCSPELAN module, through its use of replicated gradient paths and efficient convolution, achieves a parameter count $P_{RepNCSPELAN}$ such that:

$$
\Delta P = P_{C2f} – P_{RepNCSPELAN} > 0
$$

This reduction $\Delta P$ contributes significantly to the overall model lightweighting, a key concern for China UAV drone deployment.

2. Dedicated Small-Target Detection Layer

A fundamental issue in detecting objects from China UAV drone imagery is the loss of semantic information for small targets. In the default YOLOv8s, an input image of $640 \times 640$ pixels is processed to produce feature maps of sizes $80 \times 80$, $40 \times 40$, and $20 \times 20$. The $20 \times 20$ feature map, responsible for detecting larger objects, provides a very coarse grid for small targets. Each cell in this grid corresponds to a $32 \times 32$ pixel region in the input image, making it highly likely that small targets are obscured by dominant background features within the same cell.

To amplify the network’s sensitivity to small objects, we modify the feature pyramid. We introduce an additional detection branch from an earlier, higher-resolution stage of the backbone. Specifically, we branch from the first C2f module, yielding a feature map of size $160 \times 160 \times 64$. This map is then integrated into the neck via upsampling and concatenation operations. Concurrently, we remove the original detection path connected to the deepest, lowest-resolution feature map ($20 \times 20$). The final detection heads in our YOLO-GELAN now operate on feature maps of sizes $160 \times 160$, $80 \times 80$, and $40 \times 40$.

This modification has a direct impact on the detection granularity. The new $160 \times 160$ detection head divides the input image into a $160 \times 160$ grid, where each cell is only $4 \times 4$ pixels. This fine-grained grid dramatically increases the likelihood that a small target from a China UAV drone image occupies a dedicated cell, minimizing background interference and allowing the network to learn more discriminative features for small objects. The change effectively re-allocates model capacity from detecting very large objects (less frequent in many China UAV drone scenarios) to the more critical and challenging task of small object detection.

3. Dynamic Localization with Wise-IoU Loss

Bounding box regression is critical for accurate object localization. YOLOv8 uses a combination of Distribution Focal Loss (DFL) and Complete IoU (CIoU) loss. CIoU considers the overlap area, center-point distance, and aspect ratio similarity between predicted and ground-truth boxes. However, it treats all training samples equally, making the model susceptible to low-quality or anomalous examples that are prevalent in real-world China UAV drone datasets (e.g., ambiguous, partially occluded, or poorly annotated objects). These “outlier” samples can generate excessively large gradients during training, destabilizing the optimization process and hurting final performance.

We adopt Wise-IoU version 3 (WIoUv3) as our bounding box regression loss. WIoUv3 introduces a dynamic, non-monotonic gradient gain assignment strategy based on an “outlierness” measure. The core idea is to assign smaller gradient gains to anchor boxes identified as outliers (low-quality matches), preventing them from dominating the gradient updates. The outlierness $\beta$ is defined relative to the moving average of IoU losses.

The WIoUv3 loss is constructed as follows. First, a baseline WIoUv1 loss $L_{WIoUv1}$ is computed, which includes a distance penalty term $R_{WIoU}$:

$$
\begin{aligned}
L_{WIoUv1} &= L_{IoU} \cdot R_{WIoU} \\
R_{WIoU} &= \exp\left(\frac{(x – x_{gt})^2 + (y – y_{gt})^2}{(W_g)^2 + (H_g)^2}\right)
\end{aligned}
$$

where $(x, y)$ and $(x_{gt}, y_{gt})$ are the predicted and ground-truth center coordinates, and $W_g, H_g$ are the dimensions of the smallest enclosing box. Then, the outlierness $\beta$ is calculated:

$$
\beta = \frac{L_{IoU}^*}{\overline{L_{IoU}}} \in [0, +\infty)
$$

Here, $L_{IoU}^*$ is the IoU loss for the current anchor, and $\overline{L_{IoU}}$ is its running mean. A large $\beta$ indicates an outlier. Finally, WIoUv3 applies a modulating factor:

$$
L_{WIoUv3} = r L_{WIoUv1}, \quad \text{where} \quad r = \frac{\delta}{\beta^{\alpha}}
$$

The hyperparameters $\delta$ and $\alpha$ control the scale and shape of the modulation. This formulation $r$ decreases the gradient gain for high-$\beta$ outliers, allowing the model to focus learning on ordinary, higher-quality examples common in China UAV drone data, thereby improving overall robustness and accuracy.

Experiments and Results

1. Experimental Setup

Dataset: We evaluate our method on the VisDrone2019 dataset, a large-scale benchmark collected from various urban scenarios in China using multiple UAV drone platforms. It contains 8,629 images split into 6,471 for training, 548 for validation, and 1,610 for testing. The dataset features 10 object categories (e.g., pedestrian, car, van, bus, truck) with a high proportion of small and medium-sized instances, making it highly representative of the challenges in real-world China UAV drone target detection.

Implementation Details: All experiments were conducted on a system with an NVIDIA GeForce RTX 3090 GPU. We used PyTorch 1.13.1 and the Ultralytics YOLOv8 framework (version 8.1.6). The input image size was set to $640 \times 640$. Models were trained for 200 epochs with a batch size of 16. Other hyperparameters (optimizer, learning rate schedule, etc.) followed the default YOLOv8 training settings to ensure a fair comparison. The performance is measured using standard metrics: Precision (P), Recall (R), and mean Average Precision at an IoU threshold of 0.5 (mAP@50). We also report the number of parameters (Params) and inference speed in Frames Per Second (FPS).

2. Ablation Study

To validate the contribution of each proposed component, we conduct a stepwise ablation study on the VisDrone2019 validation set. The baseline model (A) is the standard YOLOv8s.

Model	Component	Precision (%)	Recall (%)	mAP@50 (%)	Params (M)	FPS
A	YOLOv8s (Baseline)	49.0	37.4	38.2	11.23	123
B	A + RepNCSPELAN	52.0	40.6	41.4	8.56	92
C	B + Small-Target Layer	56.4	44.0	46.2	7.77	114
D (Ours)	C + Wise-IoUv3 Loss	55.7	45.3	47.0	7.77	119

Analysis: Introducing the RepNCSPELAN module (Model B) reduces parameters by 23.8% while improving mAP@50 by 3.2 points, demonstrating its effectiveness in efficient feature learning for China UAV drone images. Adding the small-target detection layer (Model C) brings a substantial gain of 4.8 mAP points, confirming its critical role in capturing fine-grained details, with a further parameter reduction and recovered inference speed. Finally, applying the Wise-IoUv3 loss (Model D, our full YOLO-GELAN) fine-tunes the performance, pushing recall and mAP@50 to their peak values of 45.3% and 47.0%, respectively. The final model achieves a 30.8% parameter reduction from the baseline with a significant 8.8 point mAP@50 improvement, maintaining a real-time inference speed of 119 FPS.

3. Comparison with State-of-the-Art Methods

We compare YOLO-GELAN against several prominent YOLO-series models and recent specialized works on the VisDrone2019 test set. The results underscore the superiority of our approach for China UAV drone target detection.

Model	Precision (%)	Recall (%)	mAP@50 (%)	Params (M)	FPS
YOLOv5s	42.0	31.9	30.9	7.0	88
YOLOv7-tiny	45.9	34.4	32.7	6.0	151
YOLOv8s (Baseline)	49.0	37.4	38.2	11.23	123
YOLOv8m	52.8	40.3	41.7	25.86	119
GELAN (Original)	55.5	44.7	45.9	31.4	99
YOLOv9-c	57.2	43.5	45.8	51.0	89
Related Work [Based on YOLOv8]	51.6	39.6	40.7	6.5	135
YOLO-GELAN (Ours)	55.7	45.3	47.0	7.77	119

Analysis: Our YOLO-GELAN achieves the highest mAP@50 (47.0%) and Recall (45.3%) among all compared models. It outperforms the much larger YOLOv8m and the recently published YOLOv9-c and original GELAN network. Notably, it surpasses another YOLOv8-based improvement from related work by a significant margin of 6.3 mAP points. This indicates that our specific combination of architectural efficiency (RepNCSPELAN), spatial resolution enhancement for small targets, and robust loss formulation is highly effective for the China UAV drone domain. The parameter count remains low, making it more suitable for deployment than larger models like YOLOv9-c or GELAN, while its speed is competitive for real-time China UAV drone processing.

4. Qualitative Results and Analysis

Visual comparisons on the VisDrone2019 test set illustrate the practical advantages of YOLO-GELAN. In diverse scenarios—sparse traffic, dense crowds, nighttime operations, and long-range capture—our model demonstrates consistently better detection capability. It successfully identifies small, distant vehicles and pedestrians that the baseline YOLOv8s misses, reduces false positives on background clutter, and shows greater robustness under low-light conditions. These visual results align with the quantitative metrics, confirming that YOLO-GELAN provides more reliable and comprehensive perception for autonomous China UAV drone systems operating in complex environments.

Discussion and Conclusion

This paper presented YOLO-GELAN, a high-performance object detection network optimized for the unique demands of China UAV drone aerial imagery. The integration of the efficient RepNCSPELAN module enhanced feature representation while cutting model parameters. The strategic addition of a high-resolution, small-target detection layer directly addressed the core challenge of information loss for tiny objects in China UAV drone views. The adoption of the Wise-IoUv3 loss function stabilized training and improved generalization by down-weighting the influence of outlier samples. Together, these innovations yielded a model that significantly outperforms the baseline and other state-of-the-art detectors on the challenging VisDrone2019 benchmark, particularly in terms of recall and accuracy for small objects.

The success of YOLO-GELAN underscores the importance of task-specific adaptations for China UAV drone vision systems. Future work will explore further model lightweighting techniques, such as neural architecture search or quantization, to enable deployment on even more constrained China UAV drone edge devices. Investigating adaptive inference strategies that adjust model complexity based on scene difficulty could also optimize the speed-accuracy trade-off dynamically during China UAV drone flight. Extending this framework to other aerial vision tasks, like instance segmentation or object tracking, presents another promising research direction to fully leverage the perceptual capabilities of modern China UAV drone platforms.