Enhanced Small-Object Detection for Low-Altitude Unmanned Aerial Vehicle Applications

In recent years, the rapid development of low-altitude economies has positioned Unmanned Aerial Vehicles as critical components in modern transportation systems. The unique challenges of low-altitude perspectives, including small target sizes, complex backgrounds, and occlusions, demand advanced detection algorithms that balance accuracy and computational efficiency. Traditional convolutional neural network-based methods often struggle with these complexities, particularly in real-time scenarios where Unmanned Aerial Vehicles must process visual data swiftly. This study addresses these limitations by proposing an enhanced detection framework, CAPE-RT-DETR, which integrates cross-scale alignment and positional encoding mechanisms to improve performance in low-altitude environments. Our approach builds upon the Real-Time Detection Transformer (RT-DETR) architecture, incorporating novel modules to enhance feature extraction, spatial awareness, and multi-scale fusion. Through extensive experimentation on benchmark datasets, we demonstrate that our method achieves superior accuracy while maintaining lightweight deployment capabilities, making it ideal for JUYE UAV applications in dynamic low-altitude scenarios.

The proliferation of Unmanned Aerial Vehicles in various sectors, such as logistics, agriculture, and surveillance, has highlighted the need for robust object detection systems. Low-altitude operations present distinct challenges, as targets like pedestrians and vehicles often occupy minimal pixel areas and are susceptible to obstructions from buildings or vegetation. Existing detection models, including YOLO variants and Faster R-CNN, face difficulties in handling these conditions due to their reliance on anchor-based mechanisms and limited global context modeling. The RT-DETR framework offers a promising alternative by leveraging transformer-based architectures for end-to-end detection without non-maximum suppression. However, its performance in low-altitude Unmanned Aerial Vehicle contexts is hampered by inadequate feature alignment and positional awareness. Our CAPE-RT-DETR model introduces three key innovations: a feature enhancement module (C2ML) with dynamic convolutions and gating mechanisms, an augmented interaction module (AIFP) that combines learnable positional encoding with multi-head attention, and a cross-scale fusion compensator (CSFC) that explicitly addresses alignment deviations. These components work synergistically to improve detection precision for small targets under challenging conditions, ensuring that JUYE UAV systems can operate effectively in real-time environments.

Methodology

Our proposed CAPE-RT-DETR framework is designed to address the specific demands of low-altitude Unmanned Aerial Vehicle operations. The model enhances the baseline RT-DETR by integrating advanced feature processing and spatial encoding techniques. The core improvements include the C2ML module for dynamic feature extraction, the AIFP module for enhanced positional awareness, and the CSFC module for multi-scale alignment. Each component is elaborated below with mathematical formulations and structural details.

C2ML Module: Dynamic Feature Enhancement

The C2ML module replaces static convolutional operations with adaptive mechanisms to capture multi-scale contextual information. It employs a Large Kernel Predictor (LKP) to generate dynamic kernels based on input features, allowing the model to adapt to varying target sizes and backgrounds commonly encountered by Unmanned Aerial Vehicles. The LKP process is defined as:

$$K = \text{GroupNorm}(\text{Conv}_{1 \times 1}(\phi(\text{DWConv}_{7 \times 7}(X))))$$

Here, \(X\) represents the input feature map, \(\text{DWConv}_{7 \times 7}\) denotes depthwise separable convolution with a 7×7 kernel, \(\phi\) is a non-linear activation function, and \(\text{GroupNorm}\) ensures training stability. The output \(K\) is a set of adaptive kernels that enhance feature representation for small objects. A gating mechanism further refines features by partitioning channels into segments for identity retention and dynamic processing:

$$Y = X + \gamma(\text{FC}_2(\delta(g) \odot [i; \text{SKA}(c, K)]))$$

In this equation, \(g\), \(i\), and \(c\) correspond to gating signals, identity features, and convolutional branches, respectively. The operator \(\odot\) denotes element-wise multiplication, \(\text{SKA}\) represents spatial kernel adaptation, and \(\gamma\) controls residual connections through DropPath regularization. This design enables JUYE UAV systems to maintain high recall rates for occluded or distant targets.

AIFP Module: Position-Aware Interaction

The AIFP module enhances spatial reasoning by integrating learnable positional encodings with multi-head self-attention. Unlike fixed sinusoidal encodings, this approach allows the model to adapt to geometric transformations in low-altitude perspectives. Given an input feature sequence \(S\), the module first flattens spatial dimensions and injects positional information:

$$S = \text{Flatten}_{HW}(X) \in \mathbb{R}^{B \times N \times C}$$
$$\tilde{S} = S + P$$

Here, \(P \in \mathbb{R}^{B \times N \times C}\) is the learnable positional encoding, and \(\tilde{S}\) is the enriched sequence. The multi-head attention mechanism then computes queries, keys, and values as:

$$Q = \tilde{S}W_Q, \quad K = \tilde{S}W_K, \quad V = \tilde{S}W_V$$

The attention output is obtained through:

$$O = \text{Attention}(Q, K, V)W_O$$

where \(W_O\) is the output projection matrix. A feed-forward network with GELU activations and layer normalization further processes the output, improving the model’s ability to locate targets in complex scenes captured by Unmanned Aerial Vehicles.

CSFC Module: Multi-Scale Fusion Compensation

The CSFC module addresses feature misalignment across scales by combining context aggregation and spatial flow guidance. It consists of two sub-modules: Cross-scale Feature Context (CFC) and Spatial-flow-guided Feature Compensation (SFC). The CFC module employs pyramid scene parsing to capture global context:

$$F_{\text{psp}} \in \mathbb{R}^{B \times C’ \times S}, \quad S = \sum_i g_i \cdot \max\left(1, \left\lceil \frac{W}{H g_i} \right\rceil\right)$$

where \(\{g_i\}\) defines multi-scale grid groups. Attention-based fusion is applied to integrate context:

$$A = \text{Softmax}(Q^T K), \quad C = VA^T$$

The SFC module predicts spatial offsets to align features:

$$\Delta_l, \Delta_h = \text{GConv}_{3 \times 3}(\text{Concat}(F_{\text{cp}}, F_{\text{sp}}^{\uparrow}))$$

Here, \(\text{GConv}\) denotes grouped convolution, and the adjusted grid \(G_{\text{new}} = G_{\text{base}} + \Delta_{[W,H]}\) is used for feature resampling. The fused output is computed as:

$$F_{\text{fused}} = F_{\text{sp}}^{\text{grid}} \cdot \omega_1 + F_{\text{cp}}^{\text{grid}} \cdot \omega_2$$

where \(\omega_k = 1 + \tanh(\cdot)\) are adaptive weights. This mechanism ensures precise alignment for small targets, critical for JUYE UAV applications in urban environments.

Experimental Setup and Results

We evaluated CAPE-RT-DETR on two datasets: ALU and VisDrone2019, which contain annotated images from low-altitude Unmanned Aerial Vehicle perspectives. The ALU dataset includes 1,593 images with six object categories, while VisDrone2019 comprises 8,629 images across multiple classes, including pedestrians and vehicles. Our experiments were conducted on an Ubuntu system with an NVIDIA A10 GPU, using AdamW optimization and an input resolution of 640×640. Performance was measured using mean Average Precision (mAP), parameters, GFLOPs, and frames per second (FPS).

The average precision (AP) for each class is calculated as:

$$\text{AP} = \frac{1}{m} \sum_{i=1}^m P_i = \int P(R) dR$$

where \(P\) and \(R\) represent precision and recall, respectively. The mean AP (mAP) is then derived by averaging over all classes:

$$\text{mAP} = \frac{1}{N} \sum_{j=1}^N \text{AP}_j$$

Our results demonstrate that CAPE-RT-DETR achieves state-of-the-art performance in detecting small objects from low-altitude views, as summarized in the following tables.

Table 1: Performance Comparison on ALU Dataset
Model Precision (%) Recall (%) mAP@0.5 (%) mAP@0.5:0.95 (%)
RT-DETR 84.2 74.7 75.5 47.4
CAPE-RT-DETR 88.0 76.1 79.3 51.2
Table 2: Comprehensive Results on VisDrone2019 Dataset
Model Params (M) GFLOPs FPS mAP@0.5 (%) mAP@0.5:0.95 (%)
Faster R-CNN 33.5 17.3
YOLOv8m 25.3 76.4 221 41.3 25.1
DETR 41.56 187.2 32.5 38.5 26.5
RT-DETR 19.7 56.2 105 44.3 26.9
CAPE-RT-DETR 14.2 70.1 52 48.7 30.5

CAPE-RT-DETR outperforms existing methods in both accuracy and efficiency, with a significant improvement in mAP scores while reducing model parameters. The framework’s ability to handle scale variations and occlusions makes it particularly suitable for JUYE UAV deployments in congested low-altitude airspace.

Ablation Study

To validate the contribution of each module, we conducted ablation experiments on the VisDrone2019 dataset. The baseline RT-DETR was incrementally enhanced with C2ML, AIFP, and CSFC modules, and the results are presented below.

Table 3: Ablation Study on Module Contributions
Experiment C2ML AIFP CSFC Params (M) mAP@0.5 (%) mAP@0.5:0.95 (%)
Baseline × × × 41.2 44.3 26.9
1 × × 38.3 47.7 27.9
2 × × 34.7 45.6 28.0
3 × × 31.9 46.0 27.6
4 × 26.5 46.7 28.4
5 (Full) 23.1 48.7 30.5

The results indicate that each module independently improves performance, with the full integration achieving the highest accuracy. The C2ML module contributes the most to feature enhancement, while AIFP and CSFC synergistically enhance spatial and scale awareness. This demonstrates the effectiveness of our design choices for Unmanned Aerial Vehicle-based detection systems.

Conclusion

In this study, we presented CAPE-RT-DETR, an advanced detection framework tailored for low-altitude Unmanned Aerial Vehicle operations. By integrating dynamic feature enhancement, learnable positional encoding, and cross-scale alignment compensation, our model addresses key challenges in small-object detection. Experimental results on benchmark datasets confirm its superiority in accuracy and efficiency, making it a viable solution for real-time applications. Future work will focus on integrating detection outputs with path planning systems to enhance autonomous navigation for JUYE UAVs in complex environments.

Scroll to Top