Lightweight Baseline Network for UAV Drone Small Target Detection

In the realm of aerial surveillance and monitoring, unmanned aerial vehicles (UAV drones) have become indispensable due to their flexibility, efficiency, and ability to provide a comprehensive overview of large areas. UAV drone-based applications, such as traffic monitoring and intelligent transportation systems, rely heavily on robust object detection technologies. However, images captured by UAV drones often present significant challenges, particularly in small target detection. Small targets, characterized by low pixel resolution and occupancy in limited image regions, are difficult for conventional convolutional neural networks to detect accurately, especially when lightweight network architectures are employed to meet computational constraints on edge devices. The trade-off between detection accuracy, inference speed, model size, and robustness remains a critical issue in UAV drone vision systems. To address these challenges, I propose a lightweight baseline algorithm named YOLOv8s-LW (You Only Look Once v8–small, Lightweight), which optimizes the YOLOv8s framework through several key modifications tailored for UAV drone small target detection. This algorithm achieves a balance between computational efficiency and detection performance, making it suitable for deployment on resource-constrained UAV drone platforms.

The core contributions of this work include: (1) Replacing the baseline backbone with GD_HGNetV2 (HGNetV2 with Ghost and Depthwise optimizations) to reduce redundancy and enhance end-to-end trainability through parameter sharing and structural pruning. (2) Substituting the standard C2f modules in the neck with C2f_GDC (C2f with Ghost and Dynamic Convolution) and C2f_FB_EMA (C2f with FasterBlock and Efficient Multi-scale Attention) to strengthen multi-scale feature aggregation and generalization via spatial mixing and multi-scale attention weighting. (3) Introducing a novel loss function, FSW-IoU (Focaler–Shape–Wise Intersection over Union), which optimizes small target regression robustness by considering target difficulty, shape constraints, and matching accuracy. (4) Integrating a Dynamic Head (DyHead) to improve small target response and localization consistency in complex backgrounds. Extensive experiments on the VisDrone-2019 dataset demonstrate that YOLOv8s-LW achieves a mAP@0.5 of 38.5%, with a parameter count of 6.4×10⁶, computational cost of 19.4×10⁹ FLOPs, and inference speed of 64 frames per second. Compared to the baseline YOLOv8s, this represents a 43% reduction in parameters, a 33% reduction in computation, and a 0.2% improvement in accuracy. Ablation studies confirm that each submodule contributes positively, and their combination yields synergistic effects superior to individual improvements. The proposed algorithm offers a deployable lightweight baseline for UAV drone small target detection, providing engineering support and a transferable structural design paradigm for robust detection in complex scenarios and future multi-task extensions.

The advancement of UAV drone technology has revolutionized aerial imaging, but it also introduces unique difficulties in computer vision tasks. UAV drone-captured images often contain small targets due to high-altitude perspectives, complex backgrounds, occlusions, and dense object distributions. Traditional detection algorithms struggle with these challenges, especially when deployed on edge devices with limited computational power. Lightweight deep learning models are thus essential for real-time UAV drone applications. Among existing approaches, one-stage detectors like the YOLO series are preferred for their speed and simplicity. However, standard YOLOv8s, while efficient, still faces limitations in small target detection due to insufficient feature extraction and fusion mechanisms. Previous works have attempted to enhance YOLO variants through modifications such as multi-scale feature fusion, attention mechanisms, and optimized loss functions, but often at the cost of increased complexity or inadequate lightweighting. My approach systematically addresses these issues by integrating lightweight components, advanced attention, and robust regression techniques, specifically designed for the UAV drone domain.

In this paper, I first review the baseline YOLOv8s architecture and its shortcomings for UAV drone small target detection. Then, I detail the design of YOLOv8s-LW, including the GD_HGNetV2 backbone, the improved C2f modules, the FSW-IoU loss function, and the DyHead integration. The experimental section describes the VisDrone-2019 dataset, evaluation metrics, and implementation settings, followed by ablation studies and comparative analyses with state-of-the-art models. Finally, I conclude with the implications of this work and potential future directions. Throughout the discussion, I emphasize the applicability to UAV drone systems, ensuring that the keyword “UAV drone” is frequently highlighted to underscore the context.

Methodology: YOLOv8s-LW Algorithm

The YOLOv8s-LW algorithm is built upon the YOLOv8s framework, which consists of a backbone for feature extraction, a neck for multi-scale feature fusion, and a head for detection. My modifications target each of these components to achieve lightweighting and improved performance for UAV drone small target detection. The overall architecture is depicted in Figure 1, but here I describe each component in detail using mathematical formulations and structural insights.

GD_HGNetV2 Backbone Network

The original YOLOv8s backbone employs standard convolutions and pooling operations, which yield high parameter counts and computational complexity. To alleviate this, I propose GD_HGNetV2, a lightweight backbone that integrates Ghost Convolution (Ghost Conv) and HGNetV2 with depthwise separable convolutions (DS Conv). This design reduces redundancy while maintaining feature representation capability, crucial for UAV drone images where small targets require efficient feature extraction.

Ghost Conv operates by generating a set of intrinsic feature maps through conventional convolution, followed by cheap linear operations to produce ghost feature maps. This process significantly cuts down parameters and computations. Formally, given an input feature map $\mathbf{A} \in \mathbb{R}^{C \times H \times W}$, where $C$, $H$, and $W$ are channels, height, and width, respectively, the intrinsic feature maps $\mathbf{Y}’$ are obtained as:

$$\mathbf{Y}’ = \mathbf{A} * \mathbf{f}’,$$

where $*$ denotes convolution, and $\mathbf{f}’$ is the convolution kernel. Then, for each channel $i$ in $\mathbf{Y}’$, linear transformations $\phi_{i,j}$ (e.g., depthwise convolutions) are applied to generate ghost feature maps $\mathbf{Y}_{i,j}$:

$$\mathbf{Y}_{i,j} = \phi_{i,j}(\mathbf{Y}’_i), \quad \text{for } i=1,\ldots,m, \ j=1,\ldots,s,$$

where $m$ is the number of intrinsic maps, and $s$ is the number of ghost maps per intrinsic map. The final output is the concatenation of $\mathbf{Y}’$ and all $\mathbf{Y}_{i,j}$, resulting in $n = m \times s$ feature maps. The computational cost ratio between standard convolution and Ghost Conv is approximately $s$, leading to substantial savings.

HGNetV2 is a high-performance network that incorporates learnable affine transformations and grouped operations for efficiency. I enhance it by integrating Ghost Conv modules into HG Blocks and using DS Conv layers instead of pooling layers. DS Conv decomposes convolution into depthwise and pointwise steps. For an input with dimensions $D_F \times D_F \times M$, where $M$ is the number of input channels, and a kernel size $D_K$, the depthwise convolution parameters are $M \times D_K \times D_K$, and the pointwise convolution parameters are $M \times N$ for $N$ output channels. The total parameters are $M \times D_K \times D_K + M \times N$, which is about one-third of standard convolution parameters.

The GD_HGNetV2 backbone consists of four stages, each with multiple Ghost_HG Blocks. The stem module, HG Stem, performs initial feature extraction and downsampling. DS Conv layers are used for downsampling between stages, enhancing end-to-end trainability and reducing overfitting. This design reduces the parameter count from 11.13×10⁶ to 8.03×10⁶ and FLOPs from 28.5×10⁹ to 10.85×10⁹ compared to the original YOLOv8s backbone, while preserving feature quality for UAV drone images.

Improved C2f Modules in the Neck

The neck of YOLOv8s utilizes C2f modules for feature fusion, but they may not adequately handle multi-scale features for small targets. I introduce two variants: C2f_GDC and C2f_FB_EMA, which replace the standard Bottleneck modules in C2f.

C2f_GDC incorporates Ghost Conv and dynamic convolution to reduce computation while maintaining expressiveness. Each Bottleneck is replaced with a Ghost Module, which applies Ghost Conv layers with kernel size 3×3. Dynamic convolution adapts kernel weights based on input, improving efficiency for high-resource scenarios. The structure includes a split operation, multiple Ghost Modules with shortcut connections, and concatenation, followed by a 1×1 convolution for channel adjustment.

C2f_FB_EMA integrates the FasterBlock_EMA module, which combines FasterBlock with Efficient Multi-scale Attention (EMA). FasterBlock uses pointwise convolution (PConv) for spatial mixing, reducing computational cost. The PConv operation focuses on a subset of input channels, with computational cost $C_{op} = h \times w \times k^2 \times c_p$ and memory access $V_{mem} \approx 2 \times h \times w \times c_p$, where $c_p$ is the number of channels processed, typically much smaller than total channels. EMA enhances multi-scale feature representation by grouping channels and applying horizontal and vertical global average pooling to capture contextual information. For a feature map $\mathbf{F} \in \mathbb{R}^{C \times H \times W}$, the pooling outputs are:

$$z_c(H) = \frac{1}{W} \sum_{i=1}^{W} f_c(H, i), \quad z_c(W) = \frac{1}{H} \sum_{j=1}^{H} f_c(j, W),$$

where $f_c$ is the feature at channel $c$. These outputs are then fused via a 1×1 convolution and sigmoid activation to generate attention weights, which reweight the features. This mechanism allows the model to adaptively focus on important regions, beneficial for small targets in UAV drone imagery.

In the neck, both C2f_GDC and C2f_FB_EMA are employed to balance parameter efficiency and detection accuracy. C2f_GDC reduces computation, while C2f_FB_EMA improves feature aggregation. Their combination yields optimal results, as shown in ablation studies.

FSW-IoU Loss Function

Loss functions play a critical role in bounding box regression. The default CIoU loss in YOLOv8s may not sufficiently address small targets, which often suffer from shape variations and background clutter. I propose FSW-IoU, a novel loss that integrates ideas from Focaler-IoU, Shape-IoU, and Wise-IoU.

Focaler-IoU adjusts the IoU focus based on difficulty. Given IoU value $u$, and bounds $d, u \in [0,1]$, Focaler-IoU is defined as:

$$\text{IoU}_{\text{Focaler}} =
\begin{cases}
0, & \text{IoU} < d \\
\frac{\text{IoU} – d}{u – d}, & d \leq \text{IoU} \leq u \\
1, & \text{IoU} > u
\end{cases},$$

and the loss is $L_{\text{Focaler}} = 1 – \text{IoU}_{\text{Focaler}}$. This emphasizes hard samples, common in UAV drone detection.

Shape-IoU considers the shape of bounding boxes. For predicted box $B$ and ground truth box $B_{\text{GT}}$, the IoU is $\text{IoU} = \frac{B \cap B_{\text{GT}}}{B \cup B_{\text{GT}}}$. Shape-dependent terms are introduced:

$$w_D = \frac{2 \times (w_{\text{GT}})^S}{(w_{\text{GT}})^S + (h_{\text{GT}})^S}, \quad h_D = \frac{2 \times (h_{\text{GT}})^S}{(w_{\text{GT}})^S + (h_{\text{GT}})^S},$$

where $w_{\text{GT}}$ and $h_{\text{GT}}$ are the width and height of $B_{\text{GT}}$, and $S$ is a scale factor. The shape distance $D_{\text{shape}}$ and shape loss $\Gamma_{\text{shape}}$ are computed as:

$$D_{\text{shape}} = \frac{(x_c – x_{\text{GT}})^2}{c^2} \times w_D + \frac{(y_c – y_{\text{GT}})^2}{c^2} \times h_D,$$

$$\Gamma_{\text{shape}} = \sum_{t \in \{w, h\}} \left(1 – e^{-\zeta_t}\right)^\phi, \quad \zeta_w = \frac{|w – w_{\text{GT}}|}{\max(w, w_{\text{GT}})}, \quad \zeta_h = \frac{|h – h_{\text{GT}}|}{\max(h, h_{\text{GT}})},$$

with $\phi=4$. The Shape-IoU loss is $L_{\text{Shape-IoU}} = 1 – \text{IoU} + 0.5 \times (D_{\text{shape}} + \Gamma_{\text{shape}})$.

Wise-IoU incorporates a dynamic focusing mechanism. It defines $L_{\text{IoU}} = 1 – \text{IoU}$, and a wise term $R_{\text{Wise-IoU}} = \exp\left(\frac{(x – x_{\text{GT}})^2 + (y – y_{\text{GT}})^2}{(W^2 + H^2)^\tau}\right)$, where $W$ and $H$ are the dimensions of the minimal enclosing box, and $\tau$ is a detachment operator. The loss is $L_{\text{Wise-IoU}} = L_{\text{IoU}} \times R_{\text{Wise-IoU}}$.

FSW-IoU combines these as:

$$L_{\text{FSW-IoU}} = L_{\text{Wise-IoU}} + \lambda \times (1 – L_{\text{Focaler-Shape-IoU}}),$$

where $L_{\text{Focaler-Shape-IoU}} = (1 – \lambda) \times L_{\text{Shape-IoU}}$ with a tuning parameter $\lambda$. This composite loss enhances robustness for small target regression in UAV drone scenarios by addressing difficulty, shape, and matching accuracy simultaneously.

Dynamic Head (DyHead) Integration

The detection head in YOLOv8s uses fixed convolutional kernels, which may limit adaptability to small targets. I replace it with DyHead, which employs dynamic attention mechanisms across scale, spatial, and task dimensions. For a feature tensor $\mathbf{F} \in \mathbb{R}^{A \times B \times C}$, DyHead applies three sequential attentions:

$$\mathbf{W}(\mathbf{F}) = \pi_C(\mathbf{F}) \cdot \pi_B(\mathbf{F}) \cdot \pi_A(\mathbf{F}) \cdot \mathbf{F},$$

where $\pi_A$, $\pi_B$, and $\pi_C$ are attention functions for dimensions $A$ (scale), $B$ (space), and $C$ (task), respectively. Specifically:

$$\pi_A(\mathbf{F}) \cdot \mathbf{F} = \sigma(f(\mathbf{F})) \cdot \mathbf{F},$$

$$\pi_B(\mathbf{F}) \cdot \mathbf{F} = \frac{1}{A} \sum_{a=1}^{A} \sum_{k=1}^{K} \omega_{a,k} \cdot \mathbf{F}(a, p_k + \Delta p_k, \cdot) \cdot \Delta m_k,$$

$$\pi_C(\mathbf{F}) \cdot \mathbf{F} = \max(\alpha_1 \cdot \mathbf{F}_C + \beta_1, \alpha_2 \cdot \mathbf{F}_C + \beta_2),$$

with $\sigma$ as a hard sigmoid, $f$ a linear function, $K$ sparse sampling locations, $\Delta p_k$ and $\Delta m_k$ learned offsets and scalars, and $\alpha_i, \beta_i$ learned parameters. DyHead improves feature expression and task adaptability, crucial for detecting small targets in cluttered UAV drone images.

Experimental Setup

To validate YOLOv8s-LW, I conduct experiments on the VisDrone-2019 dataset, a benchmark for UAV drone object detection. This dataset contains images captured by UAV drones in various complex scenarios, with numerous small targets and challenging conditions like occlusion and dense crowds.

Dataset: VisDrone-2019 includes 6,471 training images and 548 validation images, annotated with 10 object categories (e.g., pedestrian, car, bus). Over 77% of targets are small (e.g., less than 50×50 pixels), and about 50% suffer from significant occlusion, making it ideal for testing UAV drone small target detection algorithms.

Evaluation Metrics: I use mean Average Precision at IoU threshold 0.5 (mAP@0.5), parameter count, computational cost in FLOPs, and inference speed in frames per second (FPS). These metrics assess accuracy, lightweightness, and real-time capability for UAV drone applications.

Implementation Details: Experiments are performed on an NVIDIA GeForce RTX 3080 GPU with 30GB memory, using PyTorch 1.11.0, Python 3.8, and CUDA 11.3. The input image size is 640×640. Training follows standard practices with data augmentation, and results are averaged over multiple runs.

Results and Analysis

The results are presented in two parts: ablation studies to evaluate individual components, and comparative experiments against state-of-the-art models.

Ablation Studies

I systematically add modules to the baseline YOLOv8s to assess their impact. Table 1 summarizes the results, where each row corresponds to a configuration with specific modules included (√) or not (⊘).

Experiment	DyHead	GD_HGNetV2	C2f_GDC	FSW-IoU	C2f_FB_EMA	mAP@0.5 (%)	Params (10⁶)	FLOPs (10⁹)	FPS
1 (Baseline)	⊘	⊘	⊘	⊘	⊘	38.3	28.5	11.13	83.0
2	√	⊘	⊘	⊘	⊘	39.4	28.1	10.85	64.0
3	√	√	⊘	⊘	⊘	38.0	22.6	8.03	64.4
4	√	√	√	⊘	⊘	37.2	18.9	6.41	66.9
5	√	√	√	√	⊘	37.8	18.9	6.41	66.1
6 (YOLOv8s-LW)	√	√	√	√	√	38.5	19.1	6.42	64.0

From Table 1, DyHead alone boosts mAP@0.5 by 1.1% (Experiment 2), highlighting its effectiveness for small target detection. Adding GD_HGNetV2 reduces parameters and FLOPs significantly but slightly decreases accuracy (Experiment 3). C2f_GDC further lightweightens the model (Experiment 4), while FSW-IoU recovers some accuracy (Experiment 5). The full YOLOv8s-LW (Experiment 6) achieves the best balance: mAP@0.5 of 38.5%, with only 6.4×10⁶ parameters and 19.4×10⁹ FLOPs, representing a 43% parameter reduction and 33% FLOP reduction from the baseline, plus a 0.2% accuracy gain. FPS remains above 60, suitable for real-time UAV drone applications.

To illustrate the training dynamics, Figure 2 plots loss curves for bounding box regression, classification, and confidence regression. YOLOv8s-LW shows lower regression loss over epochs, indicating improved localization, especially for small targets. The heatmap visualizations in Figure 3 confirm that the model focuses accurately on small target regions, with predicted bounding boxes aligning well with high-attention areas. These findings validate the design choices for UAV drone scenarios.

Comparative Experiments

I compare YOLOv8s-LW with several state-of-the-art detectors on the VisDrone-2019 dataset. Results are in Table 2, which includes mAP@0.5 for each category, overall mAP@0.5, parameter count, and FLOPs. The models range from lightweight variants like YOLOv5n to larger ones like YOLOv9c.

Model	Pedestrian	People	Bicycle	Motor	Van	Truck	Tricycle	Awing-Tricycle	Bus	Car	mAP@0.5 (%)	Params (10⁶)	FLOPs (10⁹)
YOLOv5n	29.2	24.6	2.9	29.4	15.9	15.4	8.8	6.1	23.9	65.5	22.2	1.8	4.2
YOLOv5s	37.6	29.6	7.7	35.1	30.5	22.9	15.1	10.2	35.2	71.3	29.5	7.0	15.8
YOLOv6n	26.0	18.7	3.0	28.6	32.8	21.5	14.3	7.3	34.4	70.8	25.7	4.6	11.3
YOLOx-n	34.7	30.6	7.6	38.4	33.8	24.3	18.8	9.8	39.6	74.3	31.2	2.0	5.6
YOLOv7t	37.4	33.6	7.1	41.1	35.4	25.1	18.2	8.1	45.5	74.2	32.6	6.0	13.1
YOLOv8n	32.9	26.2	7.2	33.4	37.5	26.9	20.1	11.3	44.8	75.0	31.5	3.0	8.1
YOLOv8s	41.1	32.2	12.0	43.1	44.3	35.1	26.5	15.0	54.9	79.0	38.3	11.1	28.5
YOLOv9c	50.8	40.3	23.0	53.4	52.8	49.4	40.8	21.3	66.6	83.4	48.2	50.7	236.7
YOLOv10L	30.5	18.9	13.2	34.5	43.7	48.5	22.5	23.7	61.3	75.5	37.2	25.7	126.4
YOLOv8s-LW	41.4	31.8	11.5	43.6	44.6	35.9	25.7	14.8	56.6	79.3	38.5	6.4	19.2

YOLOv8s-LW achieves a mAP@0.5 of 38.5%, outperforming most lightweight models and matching the baseline YOLOv8s in accuracy while using fewer resources. For instance, YOLOv9c attains higher mAP@0.5 (48.2%) but at the cost of 7.9× more parameters and 12.3× more FLOPs, making it less suitable for UAV drone edge deployment. YOLOv8s-LW strikes an optimal balance, with 64 FPS ensuring real-time performance for UAV drone operations. The improvements are consistent across categories, especially for small targets like pedestrians and cyclists, demonstrating its efficacy for UAV drone small target detection.

To further analyze efficiency, Figure 4 compares mAP@0.5, parameters, and FLOPs for key models. YOLOv8s-LW sits in a favorable region, offering high accuracy with low computational demand. This is critical for UAV drone systems, where onboard processing power is limited, and real-time response is essential for applications like surveillance and traffic monitoring.

Discussion and Implications

The success of YOLOv8s-LW stems from its holistic lightweight design tailored for UAV drone challenges. The GD_HGNetV2 backbone reduces computational burden without compromising feature extraction, vital for handling the low-resolution small targets common in UAV drone imagery. The improved C2f modules enhance multi-scale fusion, allowing the network to aggregate features from different levels effectively, which is key for detecting objects of varying sizes in UAV drone views. The FSW-IoU loss addresses regression nuances, improving localization for difficult small targets often encountered in UAV drone datasets. Lastly, DyHead provides dynamic adaptability, boosting detection consistency in complex UAV drone backgrounds.

From an engineering perspective, YOLOv8s-LW is highly deployable on UAV drone platforms. Its moderate parameter count and FLOPs align with typical edge device capabilities, such as NVIDIA Jetson series or similar embedded systems. The inference speed of 64 FPS meets real-time requirements for UAV drone video streaming, enabling immediate analysis and decision-making. Moreover, the modular design allows for easy customization; for example, one could swap GD_HGNetV2 with other lightweight backbones or adjust the attention mechanisms in C2f_FB_EMA for specific UAV drone scenarios.

Limitations include a slight drop in FPS compared to the baseline, but this is acceptable given the accuracy and lightweight gains. Future work could explore further optimizations, such as neural architecture search for optimal module combinations, or adaptation to other UAV drone datasets with different environmental conditions. Additionally, extending YOLOv8s-LW to multi-task learning, like simultaneous detection and segmentation for UAV drone applications, could enhance its utility.

Conclusion

In this paper, I presented YOLOv8s-LW, a lightweight baseline network for UAV drone small target detection. By integrating GD_HGNetV2, C2f_GDC, C2f_FB_EMA, FSW-IoU, and DyHead, the algorithm achieves a remarkable balance between accuracy and efficiency. On the VisDrone-2019 dataset, it attains a mAP@0.5 of 38.5% with only 6.4×10⁶ parameters and 19.4×10⁹ FLOPs, outperforming the baseline YOLOv8s in lightweightness while slightly improving accuracy. Ablation studies confirm the positive contributions of each component, and comparative experiments show its competitiveness with state-of-the-art models.

YOLOv8s-LW serves as a practical solution for UAV drone edge deployment, offering real-time performance and robust detection in complex aerial scenes. Its design paradigm is transferable to other UAV drone vision tasks, providing a foundation for future research in lightweight deep learning for aerial robotics. As UAV drone technology continues to evolve, algorithms like YOLOv8s-LW will play a crucial role in enabling intelligent, autonomous operations across diverse applications.