Enhanced Low-Altitude UAV Object Detection: A Cross-scale and Position-Aware RT-DETR Approach

With the official inclusion of low-altitude economy in China’s national development strategies, the aviation landscape is undergoing a significant transformation. Unmanned Aerial Vehicles (UAVs), or drones, have emerged as pivotal carriers within this new low-altitude transportation ecosystem in China. Their applications are rapidly diversifying across critical sectors such as precision agriculture, logistics, infrastructure surveying, and emergency response. However, a persistent challenge remains: achieving high-precision, real-time object detection under the stringent computational constraints typical of UAV platforms. This capability is fundamental for enabling autonomous navigation, mission-specific target identification, and safe integration into increasingly crowded airspaces.

The operational environment for low-altitude China UAV drone missions presents unique difficulties for vision systems. Objects of interest, such as vehicles, pedestrians, or livestock, often occupy a very small number of pixels in the captured imagery due to the altitude and oblique viewing angles. These small targets are further complicated by frequent occlusions from buildings, vegetation, or terrain. Additionally, factors like UAV platform vibration, rapid changes in perspective, and varying lighting conditions can introduce motion blur and reduce image clarity. Traditional object detection models developed for ground-based or near-field perspectives often struggle with these compounded challenges, leading to reduced recall rates and localization accuracy for small objects. Therefore, developing specialized detection algorithms that are both accurate and computationally efficient is crucial for advancing the operational reliability and scope of China UAV drone applications.

In the evolution of object detection, early two-stage methods like the R-CNN family, while accurate, were computationally heavy and slow, making them unsuitable for real-time drone applications. The advent of one-stage detectors like the YOLO series marked a turning point, offering a better balance between speed and accuracy through direct regression. Numerous improvements have been made to YOLO architectures to enhance performance for drone vision. For instance, researchers have integrated bi-directional feature pyramids and lightweight convolution modules into YOLOv8 to improve multi-scale feature fusion. Other works have introduced sparse attention mechanisms specifically for UAV imagery to bolster the model’s focus on small and occluded targets. Despite these advances, YOLO-based methods inherently depend on hand-crafted components like anchor boxes and non-maximum suppression (NMS), which can introduce hyperparameter sensitivity and post-processing bottlenecks.

The introduction of the Transformer architecture, with its powerful self-attention mechanism, offered a new paradigm for capturing long-range dependencies in images. The Detection Transformer (DETR) pioneered an end-to-end detection framework that eliminated the need for anchors and NMS, simplifying the pipeline. However, its original formulation suffered from slow training convergence and high computational cost. The subsequent development of Real-Time DETR (RT-DETR) addressed these issues by incorporating hybrid encoders and efficiency optimizations, making Transformer-based detection viable for real-time applications. RT-DETR preserves the global modeling strength of Transformers while achieving competitive inference speeds. Nonetheless, we identified room for enhancement, particularly in cross-scale feature alignment and explicit spatial position modeling for the distinct geometry of low-altitude drone perspectives.

To meet the dual demands of high precision and lightweight deployment for China UAV drone operations, this paper presents an improved model named CAPE-RT-DETR (Cross-scale Alignment and Positional Encoding-enhanced RT-DETR). Our contributions are threefold, each designed to address specific shortcomings in the low-altitude detection context while maintaining a compact model size suitable for edge deployment on drone platforms.

The core architecture of our CAPE-RT-DETR model is built upon the RT-DETR framework, which we enhance with three novel modules. The overall workflow begins with a backbone network for initial feature extraction. These multi-scale features are then processed by our enhanced modules before being passed to the hybrid encoder and finally the Transformer decoder for box prediction.

Firstly, we design the C2ML (C2f with Mixed Local-context blocks) module to replace standard feature extraction blocks in the backbone. The key innovation here is the dynamic generation of convolution kernels conditioned on the input feature map, moving beyond static filters. This allows the model to adaptively focus on features relevant to objects of varying scales and appearances common in drone scenes. The module employs a Large Kernel Predictor (LKP) to generate position-specific 3×3 kernels, enhancing receptive field coverage for small objects. A gating mechanism then selectively fuses these dynamically convolved features with the original identity path, reducing redundancy and emphasizing critical information. The dynamic convolution operation for a spatial position $(h, w)$ can be formulated as:

$$ O(h,w) = \sum_{k_h=0}^{2} \sum_{k_w=0}^{2} K(h, w, k_h, k_w) \cdot X(h + k_h – 1, w + k_w – 1) $$

where $K$ is the dynamically generated kernel and $X$ is the input feature map. The gated fusion is performed as:

$$ Y = X + \gamma \left( \text{FC}_2 \left( \delta(g) \odot [i; \text{SKA}(c, K)] \right) \right) $$

Here, the channels are split into gating $g$, identity $i$, and convolutional branch $c$. $\delta$ is the GELU activation, $\text{SKA}(c,K)$ is the output of the dynamic convolution on branch $c$, $[;]$ denotes concatenation, $\odot$ is element-wise multiplication, and $\gamma$ is a DropPath coefficient for regularization.

Secondly, we propose the AIFP (Augmented Interaction with Features via Learned Positional Encoding) module to enhance the encoder’s spatial awareness. Unlike the fixed sinusoidal positional encodings used in standard Transformers, which assume uniform periodicity, or relative encodings that focus on local relations, we employ fully learnable 2D positional embeddings. This allows the model to directly learn the optimal spatial representation from China UAV drone data, implicitly compensating for perspective distortions and the unique spatial relationships inherent in aerial imagery. The process enriches the flattened feature sequence $S$ before the attention mechanism:

$$ \tilde{S} = S + P $$

where $P \in \mathbb{R}^{B \times N \times C}$ is the learnable positional encoding matrix. This enhanced sequence $\tilde{S}$ is then used to compute the Query ($Q$), Key ($K$), and Value ($V$) matrices for multi-head self-attention (MHSA):

$$ Q = \tilde{S}W_Q, \quad K = \tilde{S}W_K, \quad V = \tilde{S}W_V $$

The attention output is refined through a standard feed-forward network (FFN) with residual connections, significantly boosting the model’s ability to perceive geometric layout and improve localization accuracy for distant, small targets.

Thirdly, we introduce a multi-scale fusion optimization block termed CSFC (Context- and Spatial-flow-guided Feature Compensation). This module tackles the feature misalignment that often occurs during simple upsampling operations in feature pyramids. It consists of two synergistic components. The Cross-scale Feature Context (CFC) component uses a Pyramid Scene Parsing (PSP) structure to gather sparse but informative global context from multiple grid scales, enhancing semantic consistency across different feature map levels. The Spatial-flow-guided Feature Compensation (SFC) component explicitly learns spatial offset fields between deep (semantic) and shallow (detailed) feature maps using a dual-group convolution. It then uses these offsets to warp the deep features into better alignment with the shallow ones via a differentiable grid sampling operation before fusion. The offset prediction and adaptive fusion are key:

$$ \Delta^l, \Delta^h = \text{GConv}_{3\times3} (\text{Concat}(F_{cp}, F_{sp}^{\uparrow})) $$
$$ F_{\text{fused}} = F_{sp}^{\text{grid}} \cdot \omega_1 + F_{cp}^{\text{grid}} \cdot \omega_2 $$

Here, $\Delta$ represents the learned spatial offsets, $\text{GConv}$ is a grouped convolution, $F^{\text{grid}}$ denotes features aligned via grid sampling, and $\omega$ are adaptive fusion weights. This explicit alignment is particularly beneficial for accurately assembling the pieces of small or partially occluded objects from different feature scales.

We evaluate the proposed CAPE-RT-DETR model on two challenging UAV datasets: the ALU dataset, featuring various vehicles in low-altitude scenarios, and the large-scale VisDrone2019 dataset, which includes diverse categories like pedestrians, cars, and bicycles in complex urban and rural settings. Our experimental environment utilizes an NVIDIA A10 GPU with PyTorch, training for 100 epochs with an input resolution of $640\times640$ and the AdamW optimizer.

The performance is measured using standard metrics: mean Average Precision at IoU thresholds of 0.5 ($mAP_{@0.5}$) and 0.5:0.95 ($mAP_{@0.5:0.95}$), model parameters (Params), Giga Floating-Point Operations (GFLOPs), and inference speed in Frames Per Second (FPS). The comparative results on the VisDrone2019 test set are summarized below.

Model	Params (M)	GFLOPs	FPS	$mAP_{@0.5}$ (%)	$mAP_{@0.5:0.95}$ (%)
Faster R-CNN	–	–	–	33.5	17.3
YOLOv8m	25.3	76.4	221	41.3	25.1
YOLOv8-SOD	14.2	40.5	255	46.1	27.0
DETR	41.6	187.2	32.5	38.5	26.5
RT-DETR (Baseline)	19.7	56.2	105	44.3	26.9
Deformable DETR	40.6	201.2	28	43.6	27.3
CAPE-RT-DETR (Ours)	14.2	70.1	52	48.7	30.5

The results clearly demonstrate the effectiveness of our approach. CAPE-RT-DETR achieves the highest detection accuracy, with a $mAP_{@0.5}$ of 48.7% and a $mAP_{@0.5:0.95}$ of 30.5%, outperforming all compared models including the latest YOLO variants and other Transformer-based detectors. Crucially, it attains this superior performance with a notably lightweight footprint of only 14.2 million parameters. While the FPS is lower than highly parallelized YOLO models due to the sequential nature of the Transformer decoder, the speed of 52 FPS comfortably meets the real-time requirements for most China UAV drone applications. The balanced trade-off between accuracy, parameter efficiency, and speed is evident.

To dissect the contribution of each proposed component, we conduct a comprehensive ablation study on the VisDrone2019 dataset, starting from the RT-DETR baseline. The results are systematically presented in the following table.

Exp. Group	C2ML	AIFP	CSFC	Params (M)	$mAP_{@0.5}$ (%)	$mAP_{@0.5:0.95}$ (%)
1 (Baseline)	×	×	×	41.2	44.3	26.9
2	√	×	×	38.3	47.7	27.9
3	×	√	×	34.7	45.6	28.0
4	×	×	√	31.9	46.0	27.6
5	×	√	√	26.5	46.7	28.4
6 (Full Model)	√	√	√	23.1	48.7	30.5

The ablation study provides clear insights. Each individual module (C2ML, AIFP, CSFC) contributes positively to the overall performance when added to the baseline. The C2ML module shows a particularly strong impact on $mAP_{@0.5}$, boosting it by 3.4%. The AIFP module contributes the most to the stricter $mAP_{@0.5:0.95}$ metric initially, highlighting its role in improving localization precision. Notably, when modules are combined (Group 5: AIFP+CSFC), the performance gain is synergistic, surpassing the sum of individual gains in some aspects, and the parameter count reduces significantly. Finally, the full integration of all three modules (Group 6) yields the best performance across all metrics while achieving the most parameter-efficient model, confirming that our proposed enhancements work in a complementary and effective manner to address the multifaceted challenges of low-altitude drone-based detection.

In conclusion, this research has systematically addressed the critical need for high-precision, lightweight object detection in the rapidly evolving domain of low-altitude China UAV drone operations. By innovating upon the RT-DETR framework, we introduced the CAPE-RT-DETR model, which integrates three key advancements: a dynamic feature enhancement module (C2ML) for adaptive context capture, a learnable position-aware interaction module (AIFP) for superior geometric understanding, and an explicit cross-scale feature compensation mechanism (CSFC) for precise multi-scale alignment. Extensive experiments on standard UAV benchmarks validate that our model achieves state-of-the-art detection accuracy, particularly for challenging small objects, while maintaining a compact model size suitable for edge deployment. The ablation studies conclusively demonstrate the efficacy and synergistic value of each component. This work provides a robust algorithmic foundation and theoretical support for enabling more reliable and intelligent China UAV drone systems in complex real-world scenarios, paving the way for safer and more efficient low-altitude transportation and services. Future work will focus on integrating this detection capability with real-time UAV path planning and decision-making systems for fully autonomous operations.