Improved Lightweight Small Target Detection Algorithm for Camera Drone Aerial Images Based on YOLOv8s

With the rapid development of camera drone technology, its applications in military reconnaissance, disaster rescue, agricultural monitoring, and urban planning have expanded significantly. Particularly in target detection, the unique aerial perspective of camera drones enables rapid, large-scale acquisition of high-resolution images. However, challenges such as complex backgrounds, small target sizes, occlusion, variable lighting, and weather conditions complicate detection tasks. Small targets often exhibit incomplete features, leading to missed detections and false positives. Effectively leveraging limited feature information to enhance detection performance remains critical for camera drone operations.

Current target detection algorithms primarily rely on deep learning models, categorized into two-stage (e.g., R-CNN variants) and single-stage approaches (e.g., RetinaNet, SSD, YOLO series). While two-stage algorithms offer higher accuracy, their computational intensity hinders real-time performance in dynamic camera drone environments. Single-stage algorithms like YOLOv8 balance speed and accuracy, making them ideal for deployment on resource-constrained UAV platforms. This work selects YOLOv8s as the baseline model due to its optimal trade-off between real-time inference and precision.

Existing studies have attempted to address camera drone-specific challenges. TPH-YOLOv5 integrated Transformer encoders and CBAM modules to boost performance. UAV-YOLOv8 enhanced accuracy using WIoUv3 and BiFormer attention but increased computational load. Other approaches incorporated lightweight blocks like GhostBlockV2 or attention mechanisms (CoordAtt, shuffle attention), yet struggled with persistent issues like occlusion handling and small-category recognition. Despite progress, limitations remain in computational efficiency, robustness for complex aerial scenes, and balanced precision-speed performance for real-time camera UAV applications.

Methodology: RTA-YOLOv8s Architecture

We propose RTA-YOLOv8s – an enhanced YOLOv8s model incorporating four key innovations to address camera drone-specific challenges: 1) RepVGG re-parameterized backbone for efficient feature extraction, 2) Triplet attention for cross-dimensional feature refinement, 3) P2 small-target detection head for enhanced granularity, and 4) WIoUv3 loss for robust bounding box regression. Figure 1 illustrates the modified architecture.

1. RepVGG Re-parameterized Backbone

Replacing standard convolutions with RepVGG blocks enhances feature extraction efficiency. During training, RepVGG utilizes multi-branch topology (3×3 conv, 1×1 conv, identity mapping), modeled as:

$$ y = g(x) + f(x) $$

where $ g(x) $ adjusts dimensions via 1×1 convolution when input/output dimensions mismatch. During inference, branches are re-parameterized into a single 3×3 convolution, significantly reducing computational overhead while maintaining performance – ideal for real-time camera drone systems. The transformation is expressed as:

$$ W’ = W^{(3)} + \text{pad}\big(W^{(1)}\big) + I $$

where $ W^{(3)} $, $ W^{(1)} $, and $ I $ represent 3×3 convolution weights, 1×1 convolution weights, and identity matrix, respectively. This structural re-parameterization accelerates inference by 23% compared to vanilla convolutions.

2. Triplet Attention Mechanism

Embedded within the backbone, Triplet attention captures cross-dimension dependencies without channel-spatial decoupling. It processes input $ \mathbf{X} \in \mathbb{R}^{C \times H \times W} $ through three parallel branches:

Branch 1 (Spatial Interaction on H-W Dimensions): Applies Z-pool for channel compression:

$$ \text{Z-Pool}(\mathbf{X}) = \big[\text{MaxPool}_{0d}(\mathbf{X}), \text{AvgPool}_{0d}(\mathbf{X})\big] \in \mathbb{R}^{2 \times H \times W} $$

followed by 7×7 convolution, batch normalization, and sigmoid activation to generate weights $ \mathbf{A}_{hw} $.

Branch 2 (C-W Interaction): Permutes input to $ \mathbb{R}^{H \times C \times W} $, processes via Z-pool + 7×7 conv, and permutes back after sigmoid activation ($ \mathbf{A}_{cw} $).

Branch 3 (C-H Interaction): Permutes to $ \mathbb{R}^{W \times H \times C} $, applies same operations, and permutes back ($ \mathbf{A}_{ch} $). Final output combines weighted features:

$$ \mathbf{X}_{\text{out}} = \frac{1}{3} \big( \mathbf{A}_{hw} \odot \mathbf{X} + \mathbf{A}_{cw} \odot \mathbf{X} + \mathbf{A}_{ch} \odot \mathbf{X} \big) $$

where $ \odot $ denotes element-wise multiplication. This mechanism amplifies small-target features with minimal FLOPs increase (0.78%), crucial for cluttered camera drone imagery.

3. P2 Small-Target Detection Head

To counter feature loss from successive downsampling, we add a P2 detection head (160×160 resolution) to the original P3-P5 heads. Shallow features from backbone layer 2 are upsampled and fused via concatenation:

$$ \mathbf{F}_{P2} = \text{Conv}_{3\times3}\big( \text{UpSample}(\mathbf{F}_{\text{backbone}}^{(2)}) \parallel \mathbf{F}_{\text{shallow}} \big) $$

where $ \parallel $ denotes concatenation. This quadruple-scale fusion (P2-P5) improves small-target recall by 11.3% on camera UAV datasets.

4. WIoUv3 Bounding Box Regression

Replacing CIoU, WIoUv3 employs a dynamic non-monotonic focusing mechanism to handle occlusion and scale variance in drone imagery. Building upon WIoUv1:

$$ \mathcal{L}_{\text{WIoUv1}} = \mathcal{L}_{\text{IoU}} \cdot R_{\text{WIoU}}, \quad R_{\text{WIoU}} = \exp\Bigg(\frac{(x – x_{gt})^2 + (y – y_{gt})^2}{(W_g^2 + H_g^2)^\ast}\Bigg) $$

WIoUv3 introduces outlier suppression via:

$$ \mathcal{L}_{\text{WIoUv3}} = r \cdot \mathcal{L}_{\text{WIoUv1}}, \quad r = \frac{\beta}{\delta_{\alpha}^{\beta – \delta}} $$

where $ \beta = \frac{\mathcal{L}_{\text{IoU}}^*}{\mathcal{L}_{\text{IoU}}} $ represents outlier degree, $ \delta $ and $ \alpha $ are hyperparameters (empirically set to 1.5 and 2.5). The gradient gain allocation strategy prioritizes ordinary-quality anchors, reducing harmful gradients from low-quality camera drone samples.

Experimental Analysis

Dataset and Implementation

Experiments used the VisDrone2019 dataset containing 8,629 annotated images across 10 categories (pedestrian, car, bus, etc.). We split data 7:2:1 (train:validation:test). Training occurred on an NVIDIA RTX 3090 with input resolution 640×640, batch size 8, SGD optimizer (lr=0.01, momentum=0.937), and early stopping. Metrics include precision (P), recall (R), mAP@0.5, mAP@0.5:0.95, parameters (N_p), FLOPs, and frames per second (FPS).

Ablation Study

Table 1 validates each component’s contribution on YOLOv8s baseline. Integrating all innovations (Row 10) achieves optimal results: mAP@0.5=44.9% (+6.1% over baseline), precision=54.5% (+4.9%), recall=43.0% (+4.8%), with 13.9% parameter reduction.

Table 1: Ablation study of RTA-YOLOv8s components on VisDrone2019
Group	Baseline	RepVGG	Triplet	P2	WIoUv3	P(%)	R(%)	mAP₅₀(%)	mAP_50:95(%)	N_p(M)	FLOPs(G)	FPS
1	✓	–	–	–	–	49.6	38.2	38.8	23.2	11.1	28.5	128.2
2	✓	✓	–	–	–	50.1	38.8	44.2	23.8	11.3	28.9	131.5
3	✓	–	✓	–	–	51.3	39.3	40.6	24.3	11.2	29.6	99.1
4	✓	–	–	✓	–	54.0	42.7	44.2	26.9	10.6	36.7	139.8
5	✓	–	–	–	✓	50.5	39.2	40.3	24.2	11.1	28.5	123.5
10	✓	✓	✓	✓	✓	54.5	43.0	44.9	27.1	10.9	38.2	88.5

Comparative Evaluation

Table 2 compares RTA-YOLOv8s against state-of-the-art models. Our method achieves the highest mAP@0.5 (44.9%) while maintaining real-time speed (88.5 FPS). Notably, it excels in small categories like bicycle (19.0% AP) and tricycle (31.2% AP), demonstrating efficacy for camera drone scenarios.

Table 2: Performance comparison on VisDrone2019 test set
Model	mAP₅₀(%)	Pedestrian(%)	People(%)	Bicycle(%)	Car(%)	Van(%)	Truck(%)	Tricycle(%)	Awning-Tricycle(%)	Bus(%)	Motor(%)	FPS
RetinaNet	31.4	28.6	20.3	9.8	73.2	33.4	31.8	15.5	14.3	58.0	25.3	23.1
YOLOv5s	35.0	40.0	32.1	12.6	73.9	36.8	32.9	22.0	12.8	47.5	39.2	142.0
TPH-YOLOv5	36.9	29.0	16.7	15.7	68.9	49.8	45.1	27.3	24.7	61.8	30.9	56.7
YOLOv8s	38.8	41.6	32.2	13.5	79.3	45.0	36.6	28.3	15.9	54.2	43.4	128.2
YOLOv10s	40.7	41.1	24.6	16.1	74.9	48.4	51.8	24.5	21.8	64.1	39.8	195.3
RTA-YOLOv8s (Ours)	44.9	52.1	42.5	19.0	84.5	48.9	39.0	31.2	19.7	58.6	53.2	88.5

Visualization and Qualitative Analysis

Grad-CAM visualizations (Figure 2) confirm RTA-YOLOv8s’ superior focus on small targets. While baseline YOLOv8s exhibits scattered attention on background elements (e.g., buildings, trees), our model concentrates heatmaps precisely on pedestrians and vehicles, even under occlusion. The P2 head and Triplet attention collectively enhance feature localization in low-visibility and dense scenes common to camera UAV operations.

Figure 3 showcases detection results across challenging conditions. RTA-YOLOv8s consistently reduces misses in edge regions (dense scenes, low light) and false positives (strong exposure), validating its robustness for real-world camera drone deployments.

System Implementation

We developed an intelligent detection system using PyQt5, integrating RTA-YOLOv8s for real-time analysis of camera drone footage. The framework (Figure 4) supports multiple input sources (local files, live camera feeds) and provides:

Adjustable confidence/IoU thresholds
Real-time detection visualization with bounding boxes, labels, and confidence scores
Per-category object counting and coordinate logging
Performance metrics (e.g., inference time)

For a 1280×720 resolution input, the system processes frames in 183 ms (5.46 FPS) on an Intel i5-12400F + RTX 3090, suitable for edge deployment on drone control units. This interface enhances operational efficiency for UAV-based surveillance and monitoring tasks.

Conclusion and Future Work

RTA-YOLOv8s significantly advances small-target detection for camera drone applications. By integrating RepVGG, Triplet attention, a P2 detection head, and WIoUv3 loss, it achieves 44.9% mAP@0.5 on VisDrone2019 – a 6.1% absolute improvement over YOLOv8s – while reducing parameters by 13.9% and maintaining real-time speed (88.5 FPS). The PyQt5-based system demonstrates practical viability for field operations. Future work will explore model pruning/knowledge distillation for further lightweighting and deployment on embedded camera UAV hardware like Jetson Orin.