TB-YOLOv8: An Advanced Algorithm for Target Detection in UAV Drone Aerial Imagery

Target detection technology for UAV drone aerial imagery has become a critical tool across numerous fields, from environmental monitoring and disaster assessment to agricultural surveillance and security patrols. The efficiency, flexibility, and relatively low operational cost of UAV drone platforms have driven their widespread adoption. However, the very nature of the imagery captured by these UAV drone systems presents unique and formidable challenges for automated detection algorithms. These challenges primarily stem from the high-altitude perspective, which leads to scenes containing numerous small-scale objects, severe scale variation among objects of the same class, and densely packed targets that frequently occlude one another. Conventional deep learning-based detectors often struggle in such environments, suffering from high rates of missed detections for small objects and false positives in cluttered backgrounds, thereby limiting their practical application in real-time UAV drone operations.

To address these persistent issues, we propose TB-YOLOv8, a novel and enhanced object detection algorithm tailored specifically for the complexities of UAV drone imagery. Our method builds upon the robust YOLOv8 framework, introducing strategic modifications across the feature extraction, multi-scale fusion, and optimization objective components of the network. These modifications are designed to work in concert to improve feature discriminability, enhance fusion efficiency across scales, and refine localization accuracy, particularly for challenging small and dense objects.

The core of our improvement lies in three key innovations. First, we redesign the fundamental feature extraction module by integrating a Triplet Attention mechanism into the C2f structure of the backbone, creating a new C2f_T module. This enhancement allows the network to better capture spatial relationships and channel dependencies in shallow layers, which is crucial for identifying small and partially occluded targets common in UAV drone footage. Second, we replace the original Path Aggregation Feature Pyramid Network (PAFPN) in the neck with a more efficient and effective Bidirectional Feature Pyramid Network (BiFPN). This change facilitates superior multi-scale feature fusion through bidirectional cross-scale connections and learnable feature weighting, enabling the detector to handle the vast scale variations encountered from a UAV drone‘s viewpoint. Finally, we adopt the Wise-IoU version 3 (WIoUv3) loss function for bounding box regression. This dynamic loss function improves the model’s robustness to varying quality of training samples and focuses gradient allocation on common-quality examples, leading to more stable convergence and better generalization performance on complex UAV drone datasets.

Comprehensive experiments conducted on the challenging VisDrone2019 benchmark dataset validate the effectiveness of TB-YOLOv8. Our algorithm demonstrates a significant performance uplift over the baseline YOLOv8s model and other contemporary detectors, achieving an optimal balance between high detection accuracy, model lightweightness, and real-time inference speed—a combination essential for deployment on resource-constrained UAV drone platforms.

1. Introduction and Related Work

The advent of deep learning has revolutionized the field of computer vision, with object detection being one of the most impactful applications. For UAV drone operations, reliable real-time detection is paramount. Detection algorithms are broadly categorized into two-stage and one-stage paradigms. Two-stage detectors, such as Faster R-CNN, first generate region proposals and then classify and refine them. They are known for high accuracy but at the cost of slower inference speed. One-stage detectors, like the YOLO (You Only Look Once) series and SSD (Single Shot MultiBox Detector), perform localization and classification in a single forward pass, offering a superior speed-accuracy trade-off that is often more suitable for real-time UAV drone applications.

Despite their success in general scenarios, standard object detectors face significant hurdles when applied directly to aerial imagery captured by UAV drones. The primary challenges include:

1. Prevalence of Small Objects: Due to the high flying altitude, objects like pedestrians and vehicles occupy very few pixels in the image. Their limited visual features make them easy to miss or confuse with background noise.

2. Extreme Scale Variation: The same category of object (e.g., a car) can appear at vastly different scales depending on its distance from the UAV drone camera, requiring detectors to have excellent multi-scale representation capabilities.

3. High Object Density and Occlusion: Urban scenes, traffic junctions, and crowded areas frequently captured by UAV drones contain densely packed objects that often overlap, leading to confusion and missed detections.

4. Complex Backgrounds: Aerial scenes can include intricate patterns from buildings, roads, vegetation, and shadows, which can generate false positives.

Recent research has attempted to address these issues. Some works have focused on enhancing feature pyramids or designing specialized heads for small objects. Others have integrated attention mechanisms or transformer modules to improve spatial reasoning. For instance, UAV-YOLOv8 employed multi-scale feature fusion techniques, while HSP-YOLOv8 introduced a dedicated small-object detection head and spatial-to-depth convolutions. Although these methods yield improvements, they often come with increased computational complexity or do not holistically address the intertwined problems of scale, density, and feature discrimination.

Our proposed TB-YOLOv8 algorithm is designed to overcome these limitations through a synergistic set of improvements. By strengthening shallow feature extraction with attention, optimizing multi-scale fusion pathways, and employing a more sophisticated optimization objective, we aim to create a detector that is both highly accurate for UAV drone imagery and efficient enough for practical deployment.

2. Methodology: The TB-YOLOv8 Architecture

The overall architecture of TB-YOLOv8 is depicted in the network structure diagram. It maintains the general pipeline of YOLOv8 but incorporates our novel modules: the C2f_T module in the backbone, the BiFPN in the neck, and the WIoUv3 loss for training. We will now delve into the details of each component.

2.1. The C2f_T Module with Triplet Attention

The backbone network is responsible for extracting hierarchical features from the input image. The original YOLOv8 uses C2f modules, which provide a rich gradient flow. However, to better handle the small and dense targets in UAV drone imagery, we need to enhance the model’s ability to focus on informative spatial regions and channel features. We achieve this by integrating a Triplet Attention (TA) mechanism into the standard Bottleneck block within the C2f module, creating a Bottleneck_T block, and subsequently a C2f_T module.

Triplet Attention is a computationally efficient yet powerful attention mechanism that captures cross-dimensional dependencies. For an input feature map, it performs three parallel attention branches: a channel attention branch and two spatial attention branches along the height and width dimensions, respectively. The process can be summarized as follows. Given an intermediate feature map X, the mechanism applies Z-pool operations to reduce dimensionality and then uses convolutional layers to generate attention weights. These weights are applied across different dimensions to refine the feature map.

Formally, the refinement can be expressed as part of the feature transformation within a bottleneck. The output Y of a Bottleneck_T block is a refined version of its input X:
$$ \mathbf{Y} = \text{TA}(\text{Conv}_{3\times3}(\text{Conv}_{1\times1}(\mathbf{X}))) + \mathbf{X} $$
where the TA() operation denotes the application of the Triplet Attention mechanism to the features. By inserting this TA-enhanced bottleneck into the C2f module, the resulting C2f_T module significantly boosts the network’s capacity to discern small and occluded objects in complex UAV drone backgrounds, leading to more discriminative feature maps for subsequent detection stages.

2.2. Lightweight Bi-directional Feature Pyramid Network (BiFPN)

Effective fusion of features from different backbone levels is crucial for detecting objects at multiple scales. While the original PAFPAN in YOLOv8 is effective, we adopt the more advanced BiFPN for its superior efficiency and fusion capability. BiFPN introduces learnable weights to fuse input features of different resolutions and enables fast, bidirectional (top-down and bottom-up) information flow.

The core idea is to allow features to be combined and refined multiple times across scales. If we denote features at different levels as \( \mathbf{P}_3, \mathbf{P}_4, \mathbf{P}_5, … \) (with increasing stride/resolution), BiFPN creates connections that allow, for example, a middle-level feature to receive context from both a higher-level (more semantic) and a lower-level (more detailed) feature. The fusion is often performed with weighted summation, where the weights are learned during training. A simplified fusion step at a node can be represented as:
$$ \mathbf{P}_{i}^{out} = \text{Conv}\left( \frac{w_1 \cdot \mathbf{P}_{i}^{in} + w_2 \cdot \text{Resize}(\mathbf{P}_{i-1}^{out}) + w_3 \cdot \text{Resize}(\mathbf{P}_{i+1}^{in})}{w_1 + w_2 + w_3 + \epsilon} \right) $$
where \( w_1, w_2, w_3 \) are learnable weights per feature, and Resize() denotes upsampling or downsampling operations. This weighted bidirectional fusion ensures that the features passed to the detection heads are rich in both semantic and fine-grained spatial information, which is vital for accurately locating objects of all sizes in UAV drone imagery, from distant pedestrians to nearby vehicles.

2.3. Wise-IoU v3 Loss Function

The choice of loss function for bounding box regression significantly impacts localization accuracy. Common losses like CIoU consider overlap area, center distance, and aspect ratio. However, they can be unstable when dealing with low-quality training examples (e.g., ambiguous or partially visible objects common in UAV drone data). We employ Wise-IoU version 3 (WIoUv3), which introduces a dynamic non-monotonic focusing mechanism.

WIoUv3 defines a loss based on the IoU metric but modulates it with a focusing coefficient that depends on the quality of the anchor box. It first defines an outlier degree \( \beta \):
$$ \beta = \frac{L_{IoU}}{\overline{L_{IoU}}} $$
Here, \( L_{IoU} = 1 – IoU \) for a given predicted box, and \( \overline{L_{IoU}} \) is a running mean of \( L_{IoU} \). A small \( \beta \) indicates a high-quality anchor box (with IoU close to the current average), while a large \( \beta \) indicates an outlier.

The WIoUv3 loss is then calculated as:
$$ L_{WIoUv3} = r L_{WIoUv1}, \quad \text{where} \quad r = \frac{\beta}{\delta \alpha^{\beta – \delta}} $$
In this formulation, \( \alpha \) and \( \delta \) are hyperparameters. The modulating factor \( r \) assigns smaller gradient gains to high-quality anchors (preventing overfitting on simple examples) and also reduces the impact of low-quality outliers (preventing harmful gradients from poor examples). This dynamic adjustment makes the training process more robust and improves the model’s final localization performance on challenging UAV drone validation sets.

3. Experiments and Results

We evaluated the performance of TB-YOLOv8 on the widely used VisDrone2019 dataset, a benchmark specifically for UAV drone vision tasks. It contains over 10,000 images with annotations for 10 object categories (e.g., pedestrian, car, van, truck) captured in various urban and suburban scenarios under different conditions.

3.1. Experimental Setup

All models were trained and evaluated using standard protocols. The input image size was set to 640×640 pixels. We used standard data augmentation techniques including mosaic, mixup, and random affine transformations. Training was conducted for 300 epochs. The performance was measured using standard metrics: mean Average Precision at IoU threshold 0.5 (mAP@0.5), Precision (P), Recall (R), F1-Score, number of parameters (Params), and inference speed in Frames Per Second (FPS).

3.2. Ablation Study

We conducted an ablation study to validate the contribution of each proposed component. The baseline is the standard YOLOv8s model. The results are summarized in the table below.

Model Configuration P (%) R (%) mAP@0.5 (%) Params (M) F1-Score
YOLOv8s (Baseline) 50.0 39.4 40.1 11.1 0.43
+ BiFPN (B) 52.4 40.3 41.4 10.2 0.45
+ BiFPN + WIoUv3 (B+W) 52.6 40.5 41.7 10.2 0.45
TB-YOLOv8 (Full: +C2f_T+B+W) 53.6 40.6 43.1 10.4 0.46

The analysis is clear. Replacing PAFPN with BiFPN (B) immediately boosts mAP by 1.3% while simultaneously reducing parameters by 0.9M, confirming BiFPN’s efficiency and effectiveness for UAV drone multi-scale fusion. Adding the WIoUv3 loss (B+W) further improves mAP by 0.3% without changing the parameter count, demonstrating its role in better optimization. Finally, incorporating the C2f_T module completes our model, delivering the most significant single gain of 1.4% in mAP with only a marginal 0.2M parameter increase. The full TB-YOLOv8 model achieves a final mAP@0.5 of 43.1%, which is a 3.0% absolute improvement over the baseline YOLOv8s, with a net reduction of 0.7M parameters (6.31% lighter). The F1-Score also improved from 0.43 to 0.46, indicating a better balance between precision and recall.

3.3. Comparison with State-of-the-Art Detectors

To firmly establish the superiority of TB-YOLOv8, we compared it against several popular and recent object detection models on the VisDrone2019 validation set. The results are presented in the following comprehensive table.

Model mAP@0.5 (%) Params (M) F1-Score FPS
Faster R-CNN 37.6 28.3 0.44 15
RT-DETR 29.1 32.8 0.33
YOLOv5s 34.0 7.2 0.39 106.5
YOLOv7-tiny 37.2 6.0 0.43 76.8
YOLOv8s (Baseline) 40.1 11.1 0.43 188.7
YOLOv9s 40.3 7.1 0.44 136.2
YOLOv11s 39.3 9.4 0.43 193.4
TB-YOLOv8 (Ours) 43.1 10.4 0.46 164.1

TB-YOLOv8 achieves the highest mAP@0.5 and F1-Score among all compared models. It outperforms the baseline YOLOv8s by 3.0%, the latest YOLOv9s by 2.8%, and YOLOv11s by 3.8%. Notably, it surpasses the two-stage Faster R-CNN by a large margin of 5.5% while being significantly faster and lighter. Although RT-DETR is a powerful transformer-based detector, it appears less suited to this specific UAV drone dataset in this configuration. In terms of speed, TB-YOLOv8 maintains a very high inference rate of 164.1 FPS, which is more than adequate for real-time processing on a UAV drone ground control station or an onboard computing unit. This combination of top-tier accuracy, manageable parameter count, and high frame rate makes TB-YOLOv8 a particularly compelling choice for practical UAV drone applications.

3.4. Visualization and Qualitative Analysis

Qualitative results further illustrate the strengths of TB-YOLOv8. Visualization techniques like Grad-CAM show that our model activates more precisely on target objects compared to the baseline, especially for small and dense groups. In side-by-side detection comparisons on complex scenes, TB-YOLOv8 consistently detects more small pedestrians and vehicles in crowded areas, exhibits fewer false positives on background structures, and provides more accurate bounding boxes. It also demonstrates a remarkable ability to identify objects that were not fully annotated in the original dataset, highlighting its strong generalization capability for real-world UAV drone deployment where perfect annotations are unavailable.

4. Conclusion and Future Work

In this work, we presented TB-YOLOv8, a high-performance object detection algorithm specifically optimized for the challenges inherent in UAV drone aerial imagery. By integrating a Triplet Attention-enhanced C2f_T module for better feature discrimination, a lightweight BiFPN for efficient multi-scale fusion, and the WIoUv3 loss for robust bounding box regression, we have created a model that significantly advances the state-of-the-art on the VisDrone2019 benchmark. TB-YOLOv8 achieves a superior balance, offering a 3.0% mAP improvement over the strong YOLOv8s baseline while simultaneously reducing the model size by 6.31% and maintaining a real-time inference speed exceeding 160 FPS.

These attributes—high accuracy, model efficiency, and fast inference—are precisely what is required for effective deployment in real-world UAV drone systems, whether for autonomous surveillance, traffic monitoring, or search and rescue operations. The design principles of our approach, focusing on enhanced attention, optimized feature pyramid design, and dynamic loss functions, provide a valuable framework for future research in aerial vision tasks.

Future work may involve exploring the integration of vision transformers or other advanced attention mechanisms in the backbone, further optimizing the architecture for specific edge devices used on UAV drones, and extending the model to handle additional tasks such as instance segmentation or tracking in aerial video streams. The continuous evolution of algorithms like TB-YOLOv8 will be crucial in unlocking the full potential of UAV drone technology across an ever-expanding range of applications.

Scroll to Top