FST-RTDETR: An Enhanced Real-Time Detector for Small Objects in UAV Drone Imagery

In recent years, the application of unmanned aerial vehicles has expanded significantly across numerous fields, including agricultural monitoring, forestry surveys, traffic management, and security patrols. This widespread adoption is largely due to their flexibility, cost-effectiveness, and ability to operate in complex environments. UAV drones provide crucial data support for real-time monitoring and analysis, with their aerial imagery being a fundamental data source. However, detecting small objects within this imagery remains a significant challenge in computer vision. Small targets, such as distant pedestrians or vehicles, often occupy only a few dozen pixels, leading to weak and indistinct features. Furthermore, UAV drone images frequently contain complex, cluttered backgrounds—like urban landscapes with dense buildings, shadows, and vegetation—which can easily cause interference, resulting in high rates of false positives and missed detections. Many existing detection algorithms struggle to balance high accuracy with the real-time processing demands essential for practical UAV drone operations.

To address these challenges, we propose FST-RTDETR, an enhanced small object detection algorithm based on the RT-DETR framework. Our method introduces key innovations in the backbone network, feature fusion architecture, and loss function to specifically improve performance on UAV drone imagery. The core of our approach involves three major modifications: redesigning the basic building block with a combination of FasterNet and an Efficient Multi-scale Attention mechanism (FasterNet-EMA), constructing an efficient small-object detection layer to enrich feature representation without excessive computational cost, and employing an advanced bounding box regression loss (Inner-MPDIoU) for more precise localization. Experimental results on the challenging VisDrone2019 dataset demonstrate that FST-RTDETR achieves superior accuracy compared to the baseline and other state-of-the-art models, maintaining efficiency suitable for real-time applications in UAV drone systems.

1. Methodology

1.1 Foundation: The RT-DETR Framework

RT-DETR is a real-time, end-to-end object detection transformer. Its primary advantage lies in eliminating the need for Non-Maximum Suppression in the post-processing stage, which simplifies the pipeline and reduces latency. The standard architecture consists of three main components: a backbone network for multi-scale feature extraction, a hybrid encoder for intra-scale and cross-scale feature interaction, and a transformer-based decoder for final prediction. For UAV drone image analysis, the default feature scales (S3, S4, S5 corresponding to 8x, 16x, and 32x downsampling) can lose fine-grained details crucial for small objects. Our work builds upon this framework to enhance its sensitivity and accuracy for small target detection in aerial scenes.

1.2 The Proposed FST-RTDETR Architecture

The overall architecture of our proposed FST-RTDETR model is illustrated below. We systematically improve upon the original RT-DETR by modifying its core modules to be more effective for the task of small object detection in UAV drone footage.

1.3 The FasterNet-EMA Backbone Module

The first enhancement targets the feature extraction backbone. We integrate the lightweight and fast FasterNet block with the Efficient Multi-scale Attention (EMA) mechanism to create a new basic module, FasterNet-EMA. The standard FasterNet block employs Partial Convolution (PConv), which applies regular convolution only to a subset of input channels, drastically reducing computational cost and memory access while maintaining representative power. Its structure can be represented as a sequence of operations: PConv, followed by two point-wise convolutions with activation and normalization layers, arranged in an inverted residual design.

The EMA attention module is designed to capture cross-spatial information efficiently without channel dimensionality reduction that causes information loss. It reshapes part of the channels into the batch dimension and groups them, distributing spatial semantic features evenly. The module features two parallel branches: a 1×1 convolutional branch and a 3×3 convolutional branch operating in parallel, enabling the aggregation of multi-scale spatial information.

By fusing these two components, the FasterNet-EMA module maintains high computational efficiency from FasterNet while gaining the multi-scale contextual awareness from EMA. This is particularly beneficial for UAV drone images where objects appear at various scales and amidst complex textures. The integrated operation can be summarized by the following processing flow within the block:

$$X’ = \text{PConv}(X)$$
$$X” = \text{Conv}_{1\times1}(\text{Act}(\text{Conv}_{1\times1}(X’))) + \text{Conv}_{3\times3}(X’)$$
$$X_{\text{out}} = \text{EMA}(X”) + X$$

Where $X$ is the input feature map, $\text{Act}$ denotes the activation function, and $\text{EMA}$ represents the attention weighting operation from the EMA module.

1.4 Efficient Small-Object Detection Layer

Introducing a high-resolution feature layer (e.g., P2, from 4x downsampling) is a common strategy to improve small object detection. However, naively adding P2 to the detection head drastically increases computational load and post-processing time. To mitigate this, we propose an efficient small-object detection layer based on an improved Cross-Scale Feature Fusion Module architecture.

Instead of directly using the P2 layer for prediction, we process it through a Space-to-Depth Convolution module. SPDConv transforms the spatial information into depth (channel) information, effectively preserving fine details that are vital for small objects in UAV drone images. This enriched feature representation is then fused with the P3 layer. Subsequently, we employ a feature integration module built upon the CSP and Omni-Kernel concepts, termed CSP-OmniKernel. This module uses three parallel branches—global, large-scale, and local—to collaboratively learn features from global context down to local details. The Omni-Kernel employs a dynamic, learnable convolutional kernel that adapts to various spatial patterns. The fusion process for feature level $P3$ enhanced by $P2$ can be conceptually described as:

$$F_{\text{P2\_rich}} = \text{SPDConv}(P2)$$
$$F_{\text{enhanced}} = \text{Concat}(P3, F_{\text{P2\_rich}})$$
$$P3_{\text{out}} = \text{CSP-OmniKernel}(F_{\text{enhanced}})$$

This design allows the network to leverage high-resolution information for small objects without the prohibitive cost of running the full detection pipeline on the P2 layer, making it highly suitable for real-time UAV drone applications.

1.5 Inner-MPDIoU Loss Function

Precise bounding box regression is critical, especially for small objects where a slight deviation represents a large relative error. The original RT-DETR uses Generalized IoU loss. While GIoU improves upon IoU by considering the minimum enclosing box, it has limitations, particularly when boxes are small or have specific positional relationships. We propose to use Inner-MPDIoU, a combination of MPDIoU and Inner-IoU, as a more effective regression loss.

MPDIoU loss minimizes the distance between the top-left and bottom-right corners of the predicted and ground-truth boxes. It is defined as:

$$L_{\text{MPDIoU}} = 1 – \text{IoU} + \frac{d_{1}^2}{h^2 + w^2} + \frac{d_{2}^2}{h^2 + w^2}$$

where $d_1$ and $d_2$ are the distances between the two pairs of corresponding corners, and $h$ and $w$ are the height and width of the input image.

The Inner-IoU strategy introduces auxiliary bounding boxes scaled by a factor $ratio \in [0.5, 1.5]$ to calculate the IoU. For $ratio < 1$, it uses a smaller auxiliary box, accelerating the convergence of high-quality samples. For $ratio > 1$, it uses a larger auxiliary box, benefiting the regression of low-IoU samples. The Intersection-over-Union is calculated between these auxiliary boxes:

$$\text{IoU}_{\text{inner}} = \frac{\text{InterSection}_{\text{inner}}}{\text{Union}_{\text{inner}}}$$

By integrating this idea into MPDIoU, the Inner-MPDIoU loss provides a more comprehensive metric that simplifies calculation, improves regression efficiency and accuracy, and offers better supervision for small object localization in UAV drone images.

2. Experiments and Results

2.1 Experimental Setup

We evaluate our proposed FST-RTDETR model on the VisDrone2019 dataset, a large-scale benchmark collected by UAV drones across various Chinese cities. It contains over 2.6 million annotated instances in challenging urban and rural settings. We use the standard split: 6,471 images for training, 548 for validation, and 1,610 for testing. The model is trained for 200 epochs with an input resolution of $640 \times 640$. Standard evaluation metrics include mean Average Precision at IoU threshold 0.5 (mAP@50), Precision (P), Recall (R), F1-score, and computational complexity in GFLOPs.

2.2 Computational Efficiency of the Small-Object Detection Layer

We first analyze the computational impact of our proposed small-object detection layer compared to naively adding a P2 detection head. The results, summarized in Table 1, show that our efficient design introduces only a minimal increase in GFLOPs over the baseline RT-DETR, while a traditional P2 layer causes a substantial computational surge. This validates the efficiency of our SPDConv and CSP-OmniKernel based approach for UAV drone image processing.

Model	GFLOPs	Params (M)
RT-DETR-r18 (Baseline)	57.0	19.9
RT-DETR-r18 + Traditional P2 Layer	81.7	18.9
RT-DETR + Our Small-Object Layer	58.6	20.1
FST-RTDETR (Ours)	59.7	17.5

2.3 Ablation Study

Ablation experiments are conducted to validate the contribution of each proposed component. The results are presented in Table 2. Experiment A shows that replacing the basic backbone block with FasterNet-EMA already improves mAP@50 while reducing parameters and GFLOPs. Experiment B demonstrates that our small-object detection layer alone brings a significant mAP gain of 1.3%. Experiment C confirms the positive effect of the Inner-MPDIoU loss. The full model (Experiment E), integrating all three improvements, achieves the highest mAP@50 of 49.6%, which is a 2.1% absolute improvement over the strong RT-DETR baseline, with fewer parameters and only a modest increase in computation. This demonstrates the synergistic effectiveness of our innovations for UAV drone-based detection.

Exp.	FasterNet-EMA	Small-Object Layer	Inner-MPDIoU	mAP@50 (%)	Params (M)	GFLOPs
Baseline				47.5	19.9	57.0
A	√			48.1	16.9	51.5
B		√		48.8	20.5	65.2
C			√	48.1	19.9	57.0
D	√	√		49.0	17.5	59.7
E (Ours)	√	√	√	49.6	17.5	59.7

2.4 Per-Category Detection Accuracy

Table 3 provides a detailed breakdown of detection performance across all ten object categories in the VisDrone dataset. FST-RTDETR outperforms the baseline RT-DETR model in every single category. The improvements are especially notable for challenging small object categories like “tricycle” (+3.8%), “awning-tricycle” (+2.8%), and “bus” (+4.8%). This consistent gain across categories, from pedestrians to vehicles, underscores the robustness and general effectiveness of our proposed modifications for object detection in UAV drone imagery.

Category	RT-DETR mAP@50 (%)	FST-RTDETR mAP@50 (%)	Improvement
pedestrian	55.8	57.5	+1.7
people	48.3	51.0	+2.7
bicycle	21.0	21.5	+0.5
car	85.6	86.4	+0.8
van	50.5	51.6	+1.1
truck	38.8	40.6	+1.8
tricycle	33.7	37.5	+3.8
awning-tricycle	18.6	21.4	+2.8
bus	62.7	67.5	+4.8
motor	59.9	61.0	+1.1
All (mAP@50)	47.5	49.6	+2.1

2.5 Comparison with State-of-the-Art Models

We compare FST-RTDETR against several prominent object detection models, including both the YOLO family and other improved RT-DETR variants, on the VisDrone2019 dataset. The overall mAP@50 results are summarized in Table 4. Our proposed FST-RTDETR achieves the highest overall mAP@50 of 49.6%, surpassing Efficient YOLOv9 (48.7%), EBC-YOLO (44.3%), Stff-RTDETR (39.6%), and ESO-DETR (41.0%). This comparative analysis highlights the superior performance of our method for the specific and challenging task of detecting objects in UAV drone-captured imagery.

Model	mAP@50 (%)
Faster R-CNN	21.7
YOLOv5	33.5
YOLOv6	29.7
YOLOv8	34.8
YOLOv11	34.4
RT-DETR-r18	47.5
Efficient YOLOv9	48.7
EBC-YOLO	44.3
Stff-RTDETR	39.6
ESO-DETR	41.0
FST-RTDETR (Ours)	49.6

2.6 Qualitative Results and Visualization

Visual comparisons on complex UAV drone scenes further demonstrate the advantages of FST-RTDETR. In densely packed small object scenes, our model successfully detects many instances (e.g., awnings-tricycles and motorcycles) that the baseline RT-DETR misses, significantly reducing the missed detection rate. In high-altitude俯瞰 scenes, FST-RTDETR shows greater robustness, avoiding false positives where the baseline model mistakenly identifies ground patches as pedestrians. In low-light/nighttime UAV drone footage, our model maintains reliable detection across objects of varying sizes with a lower probability of error. The visual evidence aligns with the quantitative metrics, confirming that our enhancements effectively address key challenges in UAV drone-based aerial image analysis.

3. Conclusion

This paper presents FST-RTDETR, an enhanced real-time detection transformer algorithm tailored for small object detection in UAV drone aerial imagery. To tackle the core challenges of weak features, background clutter, and the accuracy-speed trade-off, we introduced three strategic improvements. The FasterNet-EMA backbone module accelerates feature extraction while enhancing multi-scale perception. The efficient small-object detection layer, built with SPDConv and CSP-OmniKernel, enriches feature representation for small targets without incurring prohibitive computational costs typical of high-resolution detection heads. The Inner-MPDIoU loss function provides more precise and efficient bounding box regression.

Comprehensive experiments on the challenging VisDrone2019 dataset validate the effectiveness of each component and the integrated model. FST-RTDETR achieves state-of-the-art performance with an mAP@50 of 49.6%, outperforming the baseline RT-DETR and other contemporary models. It also maintains a lean parameter count and computational footprint suitable for real-time applications on UAV drone platforms. While the model demonstrates strong performance, future work will focus on enhancing its robustness in even more extreme and varied operational scenarios encountered by UAV drones, and on further optimizing the architecture for deployment on edge computing devices carried by drones.