ZY-DETR: An Improved RT-DETR for Small Object Detection in UAV Remote Sensing Images

In recent years, drone technology has become a cornerstone of modern remote sensing applications, enabling efficient data acquisition for traffic monitoring, disaster response, smart city planning, and precision agriculture. The rapid deployment and high flexibility of unmanned aerial vehicles (UAVs) allow for capturing images with high spatial resolution but also introduce unique challenges, particularly in detecting small objects that occupy a tiny fraction of the image. Traditional object detection methods, originally designed for general-purpose tasks, often fail to balance accuracy, real-time performance, and lightweight deployment when applied to UAV-captured scenes.

To address these issues, we propose a novel algorithm named ZY-DETR (Zoomed-Yield Detection Transformer) based on an improved RT-DETR architecture. Our work focuses on three core contributions: a heterogeneous backbone network for multi-scale feature refinement, an inter-scale fusion module with linear complexity, and a fine-grained detection head tailored for high-resolution shallow features. In this paper, we present a comprehensive description of the proposed method, along with extensive experimental validation on the VisDrone2019 and DOTA datasets. The results demonstrate that ZY-DETR achieves state-of-the-art performance in small object detection under the constraints of real-time inference and limited computational resources, making it a promising solution for drone technology.

1. Introduction

The proliferation of drone technology has revolutionized the field of remote sensing, providing an unprecedented ability to observe the Earth’s surface from low altitudes with high temporal and spatial resolution. UAVs are now widely deployed in civilian and military domains, including real-time traffic surveillance, search and rescue operations, agricultural monitoring, and urban planning. However, the unique characteristics of UAV-borne imagery, such as significant scale variation, dense small objects, complex backgrounds, and high variability in illumination, pose severe challenges for conventional object detection algorithms.

Current mainstream solutions can be categorized into three families: convolutional neural network (CNN) based detectors, Transformer-based detectors (e.g., DETR variants), and multi-scale feature fusion networks. CNN-based detectors, exemplified by the YOLO series, have been extensively adapted for drone scenarios through lightweight modules and additional small-scale detection layers. Notable examples include Drone-YOLO and TPH-YOLOv5, which improve detection of tiny targets but still suffer from homogeneous backbone designs that inadequately handle the trade-off between shallow detail preservation and deep semantic refinement. On the other hand, Transformer-based detectors like RT-DETR achieve end-to-end detection and high inference speed, yet their multi-scale fusion mechanisms often lack the necessary adaptation for remote sensing images, leading to suboptimal small-object accuracy. Feature fusion networks such as FPN, PANet, and BiFPN enhance multi-scale representation but typically incur quadratic computational complexity and rely on indirect up-down sampling pipelines that cause irreversible loss of fine-grained information for small objects.

The fundamental limitations of existing approaches can be summarized as follows: (1) homogeneous backbone designs fail to simultaneously preserve shallow details for small objects and refine deep semantics for larger ones; (2) multi-scale fusion mechanisms are either computationally expensive or inefficient in handling extreme scale variations; (3) high-resolution shallow features are not fully exploited in a scene-specific manner. To overcome these challenges, we introduce ZY-DETR, which integrates three novel components: a heterogeneous backbone network (GradNet) that adopts different modules for shallow and deep layers, a token statistical self-attention (TSSA)-based inter-scale fusion module (IST-Fusion) that reduces fusion complexity to linear time, and a fine-grained detection head (FineGrained-Detect) that directly leverages original high-resolution features without destructive downsampling.

Our main contributions are as follows:

We design GradNet, a heterogeneous backbone network that combines C2f modules for shallow feature retention and C2f-CGSA modules for lightweight local-to-global feature refinement. With only 13.55M parameters, it significantly enhances feature expressiveness.
We propose the Token Statistical Self-Attention (TSSA) mechanism, which replaces the multi-head self-attention in the original RT-DETR’s AIFI module. TSSA achieves linear computational complexity while effectively compressing channel and spatial redundancy during multi-scale fusion.
We develop the FineGrained-Detect detection head, which directly connects high-resolution shallow features (320×320) without any downsampling, combines a lightweight channel attention gate with depthwise separable convolutions, and introduces scale-specific detection branches to balance detail preservation and computational cost.

Comprehensive experiments on VisDrone2019 and DOTA datasets show that ZY-DETR achieves an AP of 23.5% on VisDrone2019 (improving by 3.2 points over the baseline RT-DETR) and 60.0% on DOTA test set, while maintaining a real-time inference speed of 59.31 FPS and 85.61 FPS respectively. The algorithm effectively solves the problem of missed small objects and low efficiency in multi-scale fusion for drone technology.

2. Methodology

2.1 Overall Architecture

The proposed ZY-DETR framework is illustrated in Figure 1 (conceptual description). It consists of three main components: (1) GradNet heterogeneous backbone for feature extraction; (2) IST-Fusion module for linear-complexity cross-scale feature alignment and fusion; (3) FineGrained-Detect detection head for high-resolution shallow feature utilization. The input is a 640×640 UAV remote sensing image, and the output includes bounding box coordinates and class probabilities for all detected objects. The entire model has 15.46M parameters and achieves a speed of 59.31 FPS on the VisDrone2019 test set, meeting the real-time requirements of drone technology.

The synergy among components is as follows: GradNet provides multi-scale features ranging from high-resolution shallow layers (rich in texture details) to low-resolution deep layers (rich in semantic context). IST-Fusion then aligns these features using the TSSA mechanism, which models global distribution statistics across tokens and fuses them with cross-scale features at linear cost. Finally, FineGrained-Detect directly takes the shallow C2 features (320×320) to preserve small-object details, applies a lightweight fusion unit, and employs scale-specific detection branches to predict object classes and boundaries. The decoder uses a query selection strategy to enable end-to-end detection without NMS.

2.2 GradNet Heterogeneous Backbone

GradNet adopts a four-level hierarchical residual structure (C2 to C5) with a total stride of 32. Unlike traditional homogeneous backbones that use the same module throughout, GradNet employs heterogeneous designs for shallow and deep layers. The shallow layers (C2 and C3) focus on preserving fine-grained details (edges, textures) of small objects using C2f (cross-stage partial convolution with two branches) modules. The deep layers (C4 and C5) achieve lightweight local-to-global feature refinement using the C2f-CGSA module, which integrates a convolutional gated linear unit (CGLU) and single-head self-attention (SHSA).

The initial embedding stage uses two consecutive 3×3 convolutions (stride 1 and 2) without pooling, downsampling the input to 320×320 resolution for C2. This design avoids the loss of small object details caused by pooling layers. In the shallow layers, the C2f module processes features by first applying a 1×1 convolution for channel reduction, then splitting the features into two branches with bottleneck residual connections, and finally concatenating and recovering the channel dimension via another 1×1 convolution.

For the deep layers, the C2f-CGSA module operates in a local-to-global order: first, CGLU performs 3×3 depthwise convolution to extract local contour features and applies dynamic channel gating to suppress redundant information; second, SHSA (single-head self-attention) models long-range spatial dependencies after layer normalization, capturing global distribution of objects. The C5 output has a resolution of 40×40, with an extended receptive field for large targets. Experimental results show that GradNet reduces the backbone parameter count from 19.88M (baseline) to 13.55M (a 31.8% reduction) and total FLOPs from 57G to 50G, while improving AP by 0.5% and AP_l by 3.5%.

2.3 IST-Fusion: Inter-Scale Fusion with Token Statistical Self-Attention

The IST-Fusion module is designed to replace the original multi-head self-attention (MHSA) in RT-DETR’s AIFI (Attention-based Intrascale Feature Interaction) module. The core innovation is the Token Statistical Self-Attention (TSSA) mechanism, which reduces the computational complexity from quadratic O(n²) to linear O(n). The overall process consists of three stages: channel redundancy compression via AIFI, token statistical modeling via TSSA, and multi-scale cross-layer fusion.

Stage 1: Adaptive Information Fusion (AIFI). For the top-level C5 feature (dimension [B, 512, 40, 40]), AIFI applies global average pooling to aggregate spatial information, then passes through a lightweight MLP (1×1 conv → ReLU → 1×1 conv) with channel compression ratio 4 (512→128→512). The resulting channel weights (after softmax) are element-wise multiplied with the original C5 feature to suppress redundant channels:

$$ \mathbf{Z}^{\text{AIFI}} = \sigma\big(\text{MLP}(\text{GlobalAvgPool}(\mathbf{X}_{C5}))\big) \odot \mathbf{X}_{C5} $$

where $\odot$ denotes element-wise multiplication. The output $\mathbf{Z}^{\text{AIFI}}$ has the same shape as $\mathbf{X}_{C5}$.

Stage 2: Token Statistical Self-Attention (TSSA). First, the feature is flattened and transposed to obtain a token sequence $\mathbf{T} \in \mathbb{R}^{B \times n \times C}$ with $n = H \times W = 1600$ tokens. A projection matrix $\mathbf{W}^P \in \mathbb{R}^{C \times p}$ (with $p=64$) maps tokens to a lower-dimensional space $\mathbf{T}^P \in \mathbb{R}^{B \times n \times p}$. Second, the second-order statistics (mean and variance) of projected tokens are computed:

$$ \boldsymbol{\mu} = \frac{1}{n}\sum_{i=1}^{n} \mathbf{T}^P[:,i,:], \quad \boldsymbol{\sigma}^2 = \frac{1}{n}\sum_{i=1}^{n} (\mathbf{T}^P[:,i,:] – \boldsymbol{\mu})^2 $$

The variance is then standardized using Z-score normalization and a small epsilon $\epsilon = 1e-6$ for numerical stability: $\tilde{\boldsymbol{\sigma}} = \sqrt{\max(\boldsymbol{\sigma}^2, \epsilon)}$. The mean $\boldsymbol{\mu}$ and normalized standard deviation $\tilde{\boldsymbol{\sigma}}$ are concatenated along the channel dimension to form $\mathbf{S} \in \mathbb{R}^{B \times 1 \times 2p}$ (with 2p=128). A linear mapping $\mathbf{W}^A \in \mathbb{R}^{2p \times n}$ followed by GELU activation and softmax over tokens generates attention weights:

$$ \mathbf{W}^{\text{TSSA}} = \text{Softmax}\big(\text{GELU}(\mathbf{S} \mathbf{W}^A)\big) $$

The weights are broadcast to the projected tokens and applied element-wise to produce refined tokens $\mathbf{T}^R = \mathbf{W}^{\text{TSSA}} \odot \mathbf{T}^P$. Finally, a recovery matrix $\mathbf{W}^R \in \mathbb{R}^{p \times C}$ maps the refined tokens back to the original channel dimension, and the tensor is reshaped to a feature map $\mathbf{X}^{\text{TSSA}} \in \mathbb{R}^{B \times C \times H \times W}$ of size 40×40.

The complexity of TSSA is O(2Cpn + Bnp) which simplifies to O(n) linear due to fixed p=64. Compared to the original MHSA (O(n²C)), this is a significant reduction.

Stage 3: Multi-scale Cross-layer Fusion. The refined feature $\mathbf{X}^{\text{TSSA}}$ is upsampled by factors of 2, 4, and 8 to match the resolutions of C4 (80×80), C3 (160×160), and C2 (320×320). After channel alignment via 1×1 convolutions, the features are fused with the original C2-C4 features through a ReLU-based convolution block (RepBlock) to produce the final multi-scale features $\mathbf{X}^i_{\text{fusion}}$ for i=2,3,4.

2.4 FineGrained-Detect Detection Head

The FineGrained-Detect head is specifically designed to exploit high-resolution shallow features for small object detection in drone technology. It directly connects the original C2 feature (320×320 resolution) from GradNet, bypassing all destructive downsampling operations. The head consists of three key designs:

Direct connection of original high-resolution features: Unlike conventional FPN/PANet/BiFPN that use “downsample + upsample” pipelines, our head takes the raw C2 feature without any spatial compression, preserving the maximum possible detail for objects smaller than 16×16 pixels.
Lightweight fusion unit: A channel attention gate combined with depthwise separable convolution replaces the traditional convolution fusion. The channel attention gate dynamically selects important feature channels (e.g., edges, textures) while suppressing background redundancy. The depthwise separable convolution reduces parameter count while effectively fusing the C2 feature with the output of IST-Fusion.
Scale-specific detection branches: Three separate branches handle small, medium, and large objects. The small-object branch operates on the 320×320 C2 feature and uses 5 anchor scales (8×8, 12×12, 16×16, 20×20, 24×24) adapted to the VisDrone2019 distribution. The medium and large branches use the IST-Fusion features at 160×160 and 80×80 resolutions, employing a lightweight decoupled head for classification and regression. Cross-scale interaction gates transfer only essential semantic features from larger scales to the small-object branch to avoid background interference.

The FineGrained-Detect head is optimized with CIoU loss for bounding box regression. Compared to traditional detection heads, it achieves a substantial improvement in small-object AP (3.0 points improvement over baseline RT-DETR) while adding only minimal computational overhead due to depthwise separable convolutions and channel attention (instead of full spatial attention).

3. Experiments

3.1 Datasets and Setup

We evaluate ZY-DETR on two mainstream benchmarks: VisDrone2019 and DOTA. VisDrone2019 consists of 6,471 training images, 548 validation images, and 1,610 test images with 10 object categories (e.g., pedestrian, car, bus). The dataset features extreme scale variation, dense small targets, and complex backgrounds (day/night, urban/suburban). DOTA contains 2,806 training images, 1,418 validation images, and 1,723 test images across 15 categories (e.g., vehicle, building, harbor). It provides complementary challenges with diverse object scales and heavy background clutter.

All experiments are conducted under a unified hardware and software configuration: NVIDIA RTX 3090 GPU (24GB), Intel i9-12900K CPU, 64GB RAM, PyTorch 2.2.2, CUDA 11.8, OpenCV 4.8.0. Training uses an input size of 640×640, batch size 4, 200 epochs, AdamW optimizer with initial learning rate 1×10⁻⁴ and weight decay 1×10⁻⁴, cosine decay scheduling to a minimum of 1×10⁻⁵. Inference is performed with batch size 1, no data augmentation, single-GPU. Metrics include FLOPs, parameters, inference speed (FPS), AP@[0.5:0.95], AP₅₀, and per-scale APs (AP_s for small, AP_m for medium, AP_l for large) according to COCO standards.

3.2 Ablation Study

We conduct ablation experiments on the VisDrone2019 test set to validate the effectiveness of each proposed module. The baseline is RT-DETR (denoted as A). We incrementally add: GradNet backbone (B), IST-Fusion module (C), and FineGrained-Detect head (D). Results are summarized in Table 1.

Table 1: Ablation study of ZY-DETR modules on VisDrone2019 test set.
Base	GradNet	IST-Fusion	FineGrained	GFLOPs	Params	AP	AP₅₀	AP_s	AP_m	AP_l	FPS	Model Size
✓				57.0	19.88M	0.203	0.355	0.112	0.300	0.352	76.28	77.0MB
	✓			50.0	13.55M	0.208	0.362	0.117	0.304	0.387	75.76	52.9MB
		✓		57.1	19.75M	0.206	0.358	0.114	0.300	0.374	93.52	76.5MB
	✓	✓		78.2	18.60M	0.220	0.376	0.128	0.320	0.372	61.64	72.4MB
	✓		✓	50.1	13.40M	0.213	0.368	0.119	0.311	0.372	92.65	52.4MB
	✓	✓		115.9	15.59M	0.231	0.396	0.137	0.333	0.416	69.13	61.1MB
		✓	✓	78.3	18.47M	0.222	0.376	0.127	0.322	0.424	69.67	71.9MB
✓	✓	✓	✓	116.0	15.46M	0.235	0.402	0.142	0.336	0.429	59.31	60.6MB

Key observations: (1) GradNet alone reduces parameters by 31.8% and FLOPs by 12.2% while improving AP by 0.5%, especially AP_l (+3.5%). (2) IST-Fusion (TSSA replacing MHSA) maintains similar FLOPs but increases inference speed from 76.28 to 93.52 FPS, with a slight AP gain (+0.3%). (3) The full model (A+B+C+D) achieves the best AP (0.235) with a 3.2 point improvement over baseline, AP_s from 0.112 to 0.142, AP_m from 0.300 to 0.336, and AP_l from 0.352 to 0.429. The model size shrinks from 77.0MB to 60.6MB.

3.3 Comparison with State-of-the-Art Detectors

We compare ZY-DETR with a wide range of recent detectors on VisDrone2019 test set, including YOLO8/10/11/12, RT-DETR-R18, two-stage detectors (Faster R-CNN, Cascade R-CNN), and advanced Transformer-based models (DINO, DEIM, D-Fine). Results are shown in Table 2.

Table 2: Performance comparison on VisDrone2019 test set.
Model	Input	GFLOPs	Params	AP	AP₅₀	AP_s	AP_m	AP_l
YOLO8m	640	78.7	25.85M	0.190	0.332	0.060	0.294	0.417
YOLO10m	640	58.9	15.32M	0.195	0.345	0.097	0.300	0.414
YOLO11m	640	67.7	20.04M	0.203	0.350	0.098	0.312	0.413
YOLO12m	640	67.2	20.11M	0.192	0.336	0.094	0.298	0.386
RT-DETR-R18	640	57.0	19.88M	0.203	0.355	0.112	0.300	0.352
Faster R-CNN R50-FPN-CIOU	768	208	41.39M	0.194	0.329	0.095	0.309	0.429
Cascade R-CNN R50-FPN	768	236	69.29M	0.197	0.326	0.099	0.309	0.406
DEIM-D-Fine-S	640	24.86	10.18M	0.219	0.384	0.122	0.321	0.397
DINO	750	274	47.56M	0.253	0.445	0.150	0.371	0.503
ZY-DETR (Ours)	640	116	15.46M	0.235	0.402	0.142	0.336	0.429

ZY-DETR achieves the highest AP among all methods except DINO, which uses a much larger input (750×1333) and 3× more parameters. Notably, ZY-DETR surpasses all YOLO variants by a large margin in AP_s (14.2% vs. 9.8% for YOLO11m) and also outperforms RT-DETR-R18 by 3.2% overall ap. The balanced performance across scales (AP_s, AP_m, AP_l) demonstrates that our algorithm effectively handles the extreme scale variations inherent in drone technology imagery.

3.4 Generalization on DOTA Dataset

To evaluate cross-domain generalization, we compare ZY-DETR with RT-DETR on the DOTA test and validation sets under identical settings. Results are given in Table 3.

Table 3: Generalization performance on DOTA dataset.
Set	Model	GFLOPs	Params	AP	AP₅₀	AP_s	AP_m	AP_l	FPS
Test	RT-DETR	57.0	19.89M	0.574	0.797	0.356	0.597	0.724	81.13
Test	ZY-DETR	116.1	15.47M	0.600	0.808	0.403	0.624	0.731	85.61
Val	RT-DETR	57.0	19.89M	0.588	0.818	0.362	0.622	0.662	78.36
Val	ZY-DETR	116.1	15.47M	0.615	0.830	0.401	0.658	0.678	82.03

On the test set, ZY-DETR improves AP by 2.6 points (from 57.4% to 60.0%), with a particularly notable 4.7 point boost in AP_s (from 35.6% to 40.3%). The inference speed also increases from 81.13 to 85.61 FPS, demonstrating superior efficiency. The validation set results confirm the trend, with a 2.7 point improvement in AP and consistent gains across all scales. These results highlight the strong generalization capability of our algorithm across different remote sensing datasets, further validating its suitability for drone technology applications.

3.5 Qualitative Analysis

We provide qualitative comparisons (conceptual description, not actual images due to space constraints) between the baseline RT-DETR and ZY-DETR on VisDrone2019. In scenarios with night-time roads, dense traffic, and daytime parking lots, ZY-DETR’s bounding boxes precisely fit the object contours, while RT-DETR exhibits misaligned or oversized boxes, especially for small and medium objects. The attention heatmaps of ZY-DETR show concentrated high-confidence regions on target objects with suppressed background responses, whereas RT-DETR’s heatmaps are scattered with significant background noise and weak signals for small targets. These visual improvements confirm that ZY-DETR effectively mitigates the core issues of small object detail loss, multi-scale misalignment, and background interference in drone technology imagery.

4. Conclusion

This paper presents ZY-DETR, an improved RT-DETR algorithm specifically designed for small object detection in UAV remote sensing images. By introducing the GradNet heterogeneous backbone, the IST-Fusion module with linear-complexity token statistical self-attention, and the FineGrained-Detect detection head that directly utilizes high-resolution shallow features, our method achieves significant gains in detection accuracy while maintaining real-time performance and lightweight deployment. Extensive experiments on VisDrone2019 and DOTA datasets demonstrate that ZY-DETR outperforms a wide range of state-of-the-art detectors, particularly in terms of small-object AP (14.2% on VisDrone2019, 40.3% on DOTA test) and balanced multi-scale precision. The algorithm’s inference speed of 59.31 FPS on VisDrone2019 and 85.61 FPS on DOTA confirms its practicality for edge devices on drones.

Future work includes further optimization through quantization and knowledge distillation to enable deployment on ultra-low-power UAV platforms, improvement of the TSSA mechanism for even more robust small object capture, and extension to other remote sensing tasks such as oriented object detection and instance segmentation. The proposed multi-scale fusion paradigm has the potential to benefit a wide range of applications in drone technology, including traffic monitoring, disaster response, and precision agriculture.