Enhanced RT-DETR for Small Object Detection in UAV Imagery

In recent years, the rapid advancement of artificial intelligence and aerial photography technology has propelled unmanned aerial vehicles (UAVs) or drones into widespread application across numerous fields such as disaster response, field monitoring, urban surveillance, and military reconnaissance. Target detection, as a core task within UAV visual perception systems, enables drones to swiftly identify and locate critical objects in complex environments, holding significant importance for the advancement of drone intelligence. Small object detection, a crucial branch of target detection, has garnered increasing attention. Small objects are typically defined as those that are relatively small in an image or video frame, exhibiting low contrast and a lack of detailed features. However, due to the high flight altitude and expansive field of view of UAVs, the captured imagery often presents significant challenges: a high proportion of small objects, complex and varied backgrounds, dense object distributions, and severe occlusions. General-purpose object detection methods often perform poorly on such UAV drone imagery, leading to frequent missed and false detections. This makes small object detection in UAV contexts a key challenge within the current field of computer vision.

Traditional object detection methods are generally categorized into two-stage and single-stage algorithms. Two-stage methods, like Faster R-CNN, first generate region proposals and then classify each proposed region, offering high accuracy but often at the cost of speed, making them less suitable for real-time applications. To address this performance bottleneck, single-stage methods such as YOLO and SSD emerged, which unify proposal generation and classification into a single, end-to-end stage, significantly boosting detection speed. Nonetheless, these CNN-based detectors are fundamentally limited by their local receptive fields, struggling to capture long-range contextual dependencies effectively, which can hinder performance in complex scenes.

The recent rise of Transformer-based vision models has introduced a new paradigm. With their self-attention mechanism, Transformers excel at modeling global dependencies, overcoming a key limitation of CNNs. The pioneering DETR framework applied this architecture to object detection, removing the need for hand-crafted components like Non-Maximum Suppression (NMS) and anchor boxes. However, its vanilla global attention mechanism suffers from slow convergence and inefficiency in handling multi-scale objects, particularly small ones. Several variants have been proposed to mitigate these issues. Deformable DETR introduced deformable attention to focus on sparse key points, improving efficiency and small object detection. Swin Transformer proposed a hierarchical architecture with shifted windows to efficiently model at various scales. Building upon these advances, Baidu’s RT-DETR was introduced as a real-time, end-to-end detector that effectively marries the local perceptual strengths of CNN backbones with the global modeling capacity of a Transformer encoder, achieving a favorable balance between accuracy and speed.

Despite RT-DETR’s effective trade-off, its performance in UAV-based small object detection scenarios can still be limited. Primary shortcomings include insufficient extraction of fine-grained details from shallow features, high computational complexity in feature interaction, and inadequate multi-scale fusion that often neglects small objects. These limitations impede the model’s accuracy and robustness when detecting small, densely packed, or occluded targets typical in drone-captured imagery. To address these challenges, this paper proposes an enhanced RT-DETR algorithm tailored for small object detection in UAV images. The main contributions of this work are threefold:

(1) Enhanced Feature Representation for Small Objects: To counter the problem of weak shallow feature extraction, a novel feature enhancement module is designed, integrating detailed shallow perception with high-level semantic information to improve the feature representation of small targets captured by the UAV drone.

(2) Optimized Feature Interaction Structure: A structurally simple, highly parallel dual-attention feature interaction mechanism is introduced to replace the original, more complex module. This maintains global modeling capability while improving inference efficiency, crucial for processing streaming data from a drone.

(3) Improved Multi-Scale Feature Fusion: To prevent small object features from being diluted during fusion, an efficient multi-scale feature fusion module is constructed. It better integrates features from different hierarchical levels, thereby boosting the detection performance for small objects in multi-scale UAV drone scenes.

Algorithm Design

The proposed algorithm builds upon the RT-DETR architecture, introducing targeted enhancements across its backbone, encoder, and feature fusion pathways to better suit the demands of UAV drone image analysis. The overall architecture is illustrated below, highlighting the key modifications.

Feature Enhancement Module (C2f_SMT)

The original RT-DETR typically employs a ResNet backbone. While effective, its feature extraction for small, complex objects in UAV drone imagery can be suboptimal. To bolster feature extraction capability, we first adopt the C2f module design from YOLOv8 as a foundation for better cross-scale feature interaction. Crucially, we propose a novel C2f Synergistic Multi-Attention Transformer (C2f_SMT) module. This module replaces the standard Bottleneck units inside C2f with our designed SMAFormerBlock. Unlike convolution-based bottlenecks, the SMAFormerBlock incorporates a Synergistic Multi-Attention (SMA) mechanism combined with a Transformer-style feed-forward network within a lightweight C2f framework, achieving organic integration of local convolution and cross-domain attention. This significantly enhances the global-local协同 modeling capability of high-level semantic feature maps (e.g., the S5 stage), which is vital for small object detection.

The core innovation lies in the SMA mechanism. Traditional Multi-Head Self-Attention (MHSA) models global dependencies well but often distributes attention uniformly, lacking focus on small targets and locally salient regions, and misses synergistic enhancement across feature dimensions. The SMA mechanism explicitly establishes multi-scale, multi-dimensional feature dependencies through the协同 operation of three parallel branches: Pixel Attention, Channel Attention, and Spatial Attention. The outputs from these branches are fused, allowing multi-dimensional features to interact, thereby integrating local and global information. This enhances the model’s perception and saliency response to small targets against complex backgrounds common in UAV drone footage.

Formally, let the input feature map be $X$. After initial processing by the C2f module, we obtain an enhanced representation $\tilde{X}$. This is then fed into the SMAFormerBlock. Within the block, the input undergoes layer normalization: $X’ = \text{LayerNorm}(\tilde{X})$. For attention computation, the SMA mechanism processes the feature through three parallel pathways:
$$A_p = \sigma(W_p X’) \quad \text{(Pixel Attention)}$$
$$A_c = \sigma(W_c X’) \quad \text{(Channel Attention)}$$
$$A_s = \text{Softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V \quad \text{(Spatial/Self-Attention)}$$
where $Q, K, V$ are query, key, and value vectors derived from $X’$, and $d_k$ is the key dimension. The pixel and channel attention outputs are combined via matrix multiplication and further modulated by the spatial attention branch:
$$A_f = A_s \otimes (A_p \cdot A_c)$$
Subsequently, an MLP module with depth-wise convolutions captures local context:
$$Z = W_1 A_f, \quad Z’ = \text{Conv}_d(\text{Conv}_p(\text{Reshape}(Z))), \quad Y’ = W_2(\text{Reshape}^{-1}(Z’))$$
The full SMAFormerBlock forward pass is:
$$Y = \tilde{X} + \text{SMA}(\text{LayerNorm}(\tilde{X})), \quad Y_{\text{out}} = Y + \text{MLP}(\text{LayerNorm}(Y))$$
The C2f_SMT module enriches the final high-level feature map with richer contextual information and stronger target saliency, providing more robust semantic support for subsequent detection of objects in UAV drone images.

Dual-Attention Feature Interaction Mechanism (DAFI)

The original AIFI module in RT-DETR uses standard self-attention, which is computationally expensive and may lack focus on local details critical for small UAV drone targets. To address this, we propose a Dual-Attention Feature Interaction (DAFI) mechanism. Inspired by efficient Transformer designs, DAFI employs a parallel lightweight structure combining Grouped Channel Self-Attention (G-CSA) and Masked Window Self-Attention (M-WSA), interspersed with Dilated Feed-Forward Networks (Dilated FFN).

DAFI operates in four stages. In the first stage, input features $X$ are processed by G-CSA, where channels are divided into $g$ groups (we set $g=4$) and attention is computed independently per group before concatenation:
$$Z_i = \text{Attention}(Q_i, K_i, V_i), \quad i=1,…,g; \quad Z = \text{Concat}(Z_1, Z_2, …, Z_g)$$
This reduces redundancy while preserving global modeling. In the second stage, a Dilated FFN expands the receptive field to encode broader local context using dilated convolutions ($*_r$ denotes convolution with dilation rate $r$):
$$Z_{\text{dil1}} = \text{GeLU}(W_1 *_1 Y), \quad Y’ = W_2 *_2 Z_{\text{dil1}}$$
We use a two-level dilation with rates $r_1=1, r_2=2$. The third stage applies M-WSA, where the feature map is divided into non-overlapping $7 \times 7$ windows. Within each window, self-attention is computed with a masking strategy that restricts query tokens to attend only to even spatial positions, reducing computation and cross-window interference:
$$Z_{win} = \text{Attention}(Q_{win}, K_{win}, V_{win}) + P_{rel}$$
Here, $P_{rel}$ is relative position encoding. Finally, another Dilated FFN stage provides non-linear mapping:
$$F_{out} = \text{FFN}_{dilated}(Z_{win})$$
Through these协同 stages, DAFI maintains global modeling capability while significantly enhancing sensitivity to local details and reducing computational overhead, making it highly suitable for real-time processing of UAV drone video streams.

Multi-Scale Feature Fusion Module (COSPFM)

The original Cross-Scale Feature Fusion Module (CCFM) in RT-DETR may insufficiently leverage the fine spatial details from very shallow layers, causing small object information to be lost during fusion. To tackle this, we propose a novel Cross-OmniKernel and Small-target Preservation Feature Fusion Module (COSPFM). A key innovation is the explicit incorporation and enhancement of a very low-level feature layer (e.g., S2).

COSPFM first processes the S2 feature using a SPDConv (Sparse Pooled Convolution) structure. SPDConv extracts richer shallow details by performing sparse pooling, sampling from four distinct spatial sub-regions to generate sub-features which are then concatenated:
$$X’ = \text{concat}\left( X[::2, ::2], X[1::2, ::2], X[::2, 1::2], X[1::2, 1::2] \right)$$
This preserves detailed texture and structural information at high resolution. The enhanced feature $X’$ is then fused with the subsequent S3 feature, ensuring small target information enters the multi-scale fusion path early and is maintained at higher resolution.

Secondly, COSPFM integrates a CSP-OmniKernel Module (COM). This module combines the CSP (Cross Stage Partial) design philosophy with a dual-branch OmniKernel structure. The OmniKernel branch itself employs a multi-branch design: one branch uses multi-scale depth-wise convolutions to capture local and large-context information, while another incorporates a Frequency Spatial Attention Module (FSAM) following a Dynamic Channel Attention Module (DCAM) to model global dependencies in the frequency domain. The outputs of these parallel branches are fused via element-wise addition. This allows the module to aggregate multi-scale and multi-domain (spatial and frequency) semantic information effectively, expanding the receptive field and enhancing feature discriminability for small, low-contrast targets often found in UAV drone imagery. The final output of COM is obtained by channel-wise concatenation of its processed features with a shortcut connection, followed by a $1\times1$ convolution.

By combining SPDConv for detail preservation and COM for sophisticated multi-scale fusion, COSPFM provides a more representative and robust set of multi-scale features to the detection head, specifically enhancing the model’s capability to detect small objects in complex UAV scenes.

Experiments and Analysis

Datasets and Experimental Setup

To validate the effectiveness and generalization ability of the proposed algorithm for UAV-based small object detection, experiments were primarily conducted on the VisDrone2019 dataset and tested on the UAVDT dataset. The VisDrone2019 dataset is a large-scale benchmark collected under diverse conditions across 14 Chinese cities, containing ten object categories like pedestrian, car, and truck. Notably, approximately 88% of objects are smaller than $32 \times 32$ pixels, and about 30% are occluded, posing significant challenges. We used the training set (6,471 images), validation set (548 images), and test set (1,610 images) as per the official split. The UAVDT dataset, focused on traffic surveillance from a drone perspective, was used for cross-dataset evaluation to test generalization. It contains three categories (car, truck, bus) with 24,778 training and 15,598 validation images, also featuring a high proportion of small objects.

All experiments were performed on a system with an NVIDIA GeForce RTX 4090 GPU, using PyTorch 2.0.1. Training parameters were set as follows: 250 epochs, batch size of 8, AdamW optimizer with a base learning rate of 0.0001 and weight decay of 0.0001. Input images were resized to $640 \times 640$ pixels.

Evaluation Metrics

We employ standard object detection metrics: Precision (P), Recall (R), mean Average Precision at IoU=0.5 (mAP@0.5), and COCO-style mAP averaged over IoU thresholds from 0.5 to 0.95 with a step of 0.05 (mAP@0.5:0.95). Model complexity is measured by parameters (Params), computational cost in Giga FLOPs (GFLOPs), and inference speed in Frames Per Second (FPS).

$$ \text{Precision} = \frac{TP}{TP + FP}, \quad \text{Recall} = \frac{TP}{TP + FN} $$
$$ \text{mAP@0.5} = \frac{1}{N} \sum_{i=1}^{N} \text{AP}_i^{\text{IoU}=0.5}, \quad \text{mAP@0.5:0.95} = \frac{1}{N} \sum_{i=1}^{N} \left( \frac{1}{10} \sum_{j=1}^{10} \text{AP}_i^{\text{IoU}=0.5+0.05(j-1)} \right) $$

Ablation Study

To evaluate the contribution of each proposed component, an ablation study was conducted on the VisDrone2019 validation set. The baseline is the standard RT-DETR model. The results are summarized in the table below.

Config	C2f_SMT	DAFI	COSPFM	P(%)	R(%)	mAP@0.5(%)	mAP@0.5:0.95(%)	GFLOPs	Params (M)	FPS
1 (Baseline)	×	×	×	61.5	46.3	47.9	29.3	57.0	19.9	60.0
2	√	×	×	62.6	46.7	48.4	29.9	61.5	28.4	52.4
3	×	√	×	60.1	46.1	47.5	28.7	58.7	21.8	66.5
4	×	×	√	62.2	47.6	49.0	30.0	65.2	20.5	64.0
5	√	√	×	63.2	47.2	49.2	30.6	63.2	30.3	52.5
6	√	×	√	62.9	49.3	50.3	31.0	77.2	29.6	59.3
7	×	√	√	62.6	48.4	49.4	30.4	66.9	22.4	63.7
8 (Ours)	√	√	√	63.5	49.8	51.0	31.5	78.9	31.5	54.5

The results demonstrate the individual and synergistic effectiveness of each module. C2f_SMT enhances high-level semantic modeling, giving a modest boost. DAFI significantly improves inference speed (FPS) while maintaining accuracy. COSPFM provides the most substantial individual gain in mAP@0.5 (+1.1%) by improving multi-scale fusion. The combination of C2f_SMT and COSPFM (Config 6) yields strong results, and the full model (Config 8) achieves the best overall performance: mAP@0.5 of 51.0% and mAP@0.5:0.95 of 31.5%, representing gains of 3.1% and 2.2% over the baseline, respectively, while maintaining a real-time inference speed of 54.5 FPS suitable for UAV drone applications.

Comparative Experiments

The proposed model is compared against a wide range of state-of-the-art detectors on the VisDrone2019 test set, including two-stage (Faster R-CNN, Cascade R-CNN), single-stage (YOLO series, RetinaNet), and Transformer-based detectors (DDQ-DETR, RT-DETR variants). The results are presented in the following table.

Model	P(%)	R(%)	mAP@0.5(%)	mAP@0.5:0.95(%)	GFLOPs	Params (M)	FPS
Faster R-CNN	49.5	36.7	39.2	23.2	208.2	41.2	31.3
Cascade R-CNN	50.2	36.4	39.5	23.8	236.1	69.3	25.6
RetinaNet	47.7	38.1	36.9	22.1	210.0	36.4	33.6
YOLOv5l	55.9	41.6	43.3	26.4	134.6	53.1	72.8
YOLOv8l	55.0	42.7	44.2	27.1	164.9	43.5	69.2
YOLOv10l	55.3	42.5	44.3	27.6	126.4	25.7	61.5
YOLOv11l	55.6	42.4	44.1	27.3	86.7	25.4	63.2
DDQ-DETR	54.0	44.5	44.8	26.1	–	–	12.7
RT-DETR (Baseline)	61.5	46.3	47.9	29.3	57.0	19.9	60.0
RTDETR-SDI	54.6	47.5	49.8	30.3	52.4	34.2	40.6
Drone-DETR	–	–	50.4	31.1	68.8	19.1	40.0
Ours	63.5	49.8	51.0	31.5	78.9	31.5	54.5

The proposed method achieves the highest mAP@0.5 and mAP@0.5:0.95 among all compared models. It outperforms the baseline RT-DETR by a significant margin (3.1% in mAP@0.5) and also surpasses other recent RT-DETR improvements tailored for drone detection, such as Drone-DETR. Notably, it maintains a high inference speed (54.5 FPS), which is substantially faster than other Transformer-based models like DDQ-DETR (12.7 FPS) and is adequate for real-time analysis on a UAV drone platform.

To further verify generalization, the trained model was evaluated on the UAVDT dataset without fine-tuning. The results, focusing on per-category mAP@0.5, are shown below.

Category	RT-DETR mAP@0.5 (%)	Ours mAP@0.5 (%)
car	73.9	73.9
truck	7.8	10.5
bus	15.7	18.8
All (mAP)	32.5	34.4

The improved model demonstrates clear gains on the more challenging “truck” and “bus” categories, which often appear as smaller objects in UAVDT, while maintaining the same high performance on the larger “car” category. This cross-dataset improvement confirms the enhanced generalization and robustness of our method for small object detection in diverse UAV drone scenarios.

Conclusion

This paper addresses key challenges in small object detection within UAV drone imagery, such as weak feature expression, inefficient feature interaction, and suboptimal multi-scale fusion. We propose an enhanced RT-DETR algorithm incorporating three novel modules: the C2f_SMT for enriched feature representation, the DAFI for efficient and effective global-local feature interaction, and the COSPFM for detail-preserving multi-scale fusion. Extensive experiments on the VisDrone2019 dataset demonstrate that our model achieves state-of-the-art performance, with a mAP@0.5 of 51.0% and mAP@0.5:0.95 of 31.5%, while maintaining real-time inference speed. Successful cross-dataset evaluation on UAVDT further validates its strong generalization capability. Future work will focus on further lightweight optimization of the network structure to reduce computational cost, aiming to deploy an even more efficient model for real-time small object detection on resource-constrained UAV drone platforms.