
The rapid advancement of UAV drone technology has significantly expanded its applications across various fields, concurrently driving the growing demand for reliable and efficient aerial detection systems. Infrared imaging, owing to its advantages in passive detection, all-weather operation, and strong anti-interference capability, has become a crucial sensing modality for UAV drone platforms, especially for surveillance, reconnaissance, and search-and-rescue missions. However, the detection of small targets in infrared imagery captured by UAV drones remains a formidable challenge. These targets often appear as tiny, low-contrast thermal signatures, occupying only a few pixels in the image. Their weak feature expression is further compounded by complex, cluttered backgrounds, such as clouds, terrain, or thermal noise, making them easily confusable with background artifacts. This presents a significant obstacle to achieving robust situational awareness and autonomous navigation for UAV drones. Consequently, designing an effective infrared small target detection algorithm that is adaptable to complex scenarios, computationally efficient, and maintains high accuracy is a critical research direction. In this work, we address these challenges by proposing an improved detection framework, PGF-RTDETR, which enhances multi-scale feature perception and adaptive filtering for superior performance in UAV drone infrared small target detection.
1. Introduction and Related Work
Deep learning-based object detection algorithms have demonstrated immense potential for UAV drone infrared target detection. Existing methods are broadly categorized into two-stage and one-stage detectors. Two-stage algorithms, such as Faster R-CNN, first generate region proposals and then perform classification and regression. While often achieving high accuracy, their multi-stage pipeline leads to high computational complexity and slow inference speeds, which are less suitable for the real-time and lightweight requirements of UAV drone platforms. In contrast, one-stage algorithms like the YOLO series and DETR variants perform localization and classification in a single network pass, offering faster inference and lower computational burden, making them more amenable for deployment on resource-constrained UAV drone hardware.
To enhance the detection of infrared small targets under complex conditions, numerous innovations based on the YOLO architecture have been proposed. For instance, YOLO-TSL incorporates a triplet attention mechanism and a Slim-Neck structure to boost feature extraction while reducing complexity. YOLO-ViT integrates a vision transformer backbone and CARAFE upsampling to improve feature representation for multi-scale targets. YOLO-FR addresses feature loss during sampling by introducing dedicated down-sampling and up-sampling modules. Similarly, YOLO-SDLUWD employs spatial depth convolutions and an enhanced feature pyramid to mitigate information loss. Despite their progress, YOLO-based methods fundamentally rely on local receptive fields of convolutional operations, which struggle to capture long-range dependencies. This limitation can lead to increased false positives and missed detections in low-contrast infrared images where targets are easily confused with background clutter. Furthermore, their dependence on Non-Maximum Suppression (NMS) with fixed thresholds can cause erroneous suppression in scenes with dense or multi-scale targets.
The Detection Transformer (DETR) framework introduced a promising paradigm shift by eliminating the need for hand-crafted components like anchors and NMS, offering an end-to-end detection approach with global modeling capabilities via the Transformer architecture. However, the original DETR suffers from slow convergence and insufficient modeling capacity for small targets. Its real-time variant, RT-DETR, addressed the speed issue through an efficient hybrid encoder and IoU-aware query selection. Recent works like Hybrid-DETR and FP-RTDETR have further attempted to enhance small target detection within the DETR framework by designing specialized backbones or feature fusion modules. Nevertheless, a key limitation persists: existing methods often lack an explicit mechanism to resolve the feature ambiguity between extremely weak small targets and strong background clutter in infrared imagery. Additionally, there is room for improvement in both feature extraction efficiency and fine-grained modeling capability.
To overcome these limitations, we present PGF-RTDETR, an enhanced model built upon the RT-DETR framework. Our contributions are threefold and are summarized in the network structure diagram below. First, we construct a Polarized Channel-Gated (PCG) unit to replace the standard feature fusion module. This unit leverages a polarized attention mechanism and a gating mechanism to guide deep semantic fusion of multi-scale features, effectively enhancing the discriminability of small targets against complex backgrounds. Second, we design an Adaptive Convolution Enhancement module, gConvC3, which employs a gating mechanism to increase feature selectivity and fineness, thereby improving the extraction efficiency of fine-grained features critical for small targets. Finally, in the backbone network, we replace the standard residual blocks with a more efficient structure called FasterBlock, which integrates partial convolution and channel transformation. This modification enhances feature expression power while simultaneously and significantly reducing the model’s parameter count and computational load, aligning with the efficiency needs of UAV drone deployment.
2. Methodology: The PGF-RTDETR Framework
Our proposed PGF-RTDETR algorithm is designed to improve the accuracy and efficiency of infrared small target detection for UAV drones. The overall architecture builds upon the RT-DETR framework, which consists of a backbone for feature extraction, an efficient hybrid encoder for feature interaction and fusion, and a Transformer decoder for final prediction. We introduce targeted modifications in all three parts, as illustrated in the following structural overview.
2.1 Polarized Channel-Gated (PCG) Unit for Feature Fusion
In UAV drone infrared small target detection, targets occupy minimal pixels and their subtle features can be diluted or lost during deep convolutional processing. Standard convolutional networks operating on single-scale features often fail to capture the necessary multi-scale context, leading to insufficient feature expression and weakened discriminative power for targets淹没 in clutter.
To address the dual problems of weak feature expression for small targets and limited adaptability of single-scale features, we propose the Polarized Channel-Gated (PCG) unit. This unit replaces the standard AIFI module in the hybrid encoder. The PCG synergistically combines a Polarity-aware Linear Attention (Pola) mechanism with a Convolutional Gated Linear Unit (CGLU). The core innovation of the Pola mechanism is its explicit modeling of both correlation and contrast within features through polarity decomposition. This is particularly beneficial for infrared imagery where small target signals are weak and easily confused with background clutter. Pola decomposes the Query (Q) and Key (K) matrices into positive and negative polarity branches using ReLU and its variants. This process explicitly guides the model to learn the feature contrast between the infrared small target and the background clutter, while also capturing the feature correlation within homogeneous background regions. By amplifying the contrast at the feature level, it enhances the model’s ability to distinguish faint targets.
The CGLU structure complements this by applying an adaptive gating mechanism to regulate the information flow. Tailored for the low signal-to-noise ratio characteristic of infrared images, this gating mechanism dynamically assesses the importance of each feature channel. It automatically amplifies channels containing crucial small target information while suppressing channels dominated by background noise. The gated fusion process is defined as:
$$ \mathbf{Y} = \mathbf{F} \odot \mathbf{G} $$
where $\mathbf{F}$ is the feature branch, $\mathbf{G}$ is the gating branch, and $\odot$ denotes the Hadamard (element-wise) product.
By integrating Pola’s polarity-aware contrast enhancement with CGLU’s channel-wise adaptive filtering, the PCG unit enables more effective fusion of shallow, fine-grained features with deep, semantic features. The result is a feature representation with heightened discriminative power and robustness, significantly boosting detection performance for infrared small targets amidst complex background interference from a UAV drone‘s perspective.
2.2 Adaptive Convolution Enhancement with gConvC3
While existing modules like RepC3 can enhance feature expression in multi-scale fusion tasks, they often lack adequate feature selectivity and adaptability. Their reliance on local convolutional operations limits their ability to capture long-range dependencies and to adaptively filter key information, which is essential when dealing with the weak, small thermal signatures of targets against noisy infrared backgrounds captured by UAV drones.
To overcome this limitation, we designed the gConvC3 module, based on a gated convolution block (gConv) integrated into a Cross-Stage Partial (CSP) structure. The gConvC3 module uses a gating mechanism to perform adaptive weighting on both feature channels and spatial information, dynamically modulating the contribution of different features during fusion. This mechanism selectively enhances the low-contrast, small-scale, and blurry thermal radiation features of targets while effectively suppressing irrelevant features from background interference. This selective enhancement of critical target features and suppression of background noise substantially improves the network’s discriminative capability in complex infrared scenes encountered by UAV drones.
Within the CSP structure, the main branch is composed of stacked gated convolution units. In each unit, the normalized input feature $\hat{\mathbf{X}}$ is fed into two parallel branches. One branch, $\mathbf{x}_1$, generates adaptive gating weights via a 1×1 convolution followed by an activation function. The other branch, $\mathbf{x}_2$, extracts local spatial features using depthwise separable convolution. The two branches are then fused via Hadamard product, where the gating weights adaptively modulate the local spatial features. This modulated output is subsequently processed by a pointwise convolution and combined with a residual connection $\mathbf{x}$ to produce the final output $\mathbf{y}$. The core gated fusion and output process can be formulated as:
$$ \mathbf{y} = \mathbf{x} + \text{PWConv}(\mathbf{x}_1 \odot \mathbf{x}_2) $$
where PWConv denotes pointwise convolution.
2.3 Lightweight Backbone with FasterBlock
The design of the backbone network is pivotal for feature extraction, especially under the challenging conditions of UAV drone infrared imaging, where targets are small, low-contrast, and set against noisy backgrounds. Traditional backbone networks often suffer from rapid loss of fine-grained detail due to successive down-sampling operations, creating a tension between detection accuracy and computational efficiency required for UAV drone platforms.
To resolve this, we optimize the backbone by replacing the standard BasicBlock residual units with a more efficient structure termed FasterBlock. The core idea of FasterBlock is to reduce redundant computation and memory access through a differential design while preserving the capacity to extract spatial details. It achieves this by intelligently integrating Partial Convolution (PConv) within a residual framework.
Standard convolution applies filters to all input channels, leading to significant redundant computation over large, uninformative background regions common in infrared imagery. In contrast, PConv exploits spatial and channel redundancy by performing convolution only on a subset of contiguous input channels (e.g., the first or last $c_p$ channels), leaving the remaining channels untouched. By concentrating computation on channels likely to contain key thermal signals, PConv optimizes resource utilization. When the partial ratio $r = c_p / c$ is set to 1/4, the Floating-Point Operations (FLOPs) of PConv become only 1/16 of a standard convolution, and the Memory Access Cost (MAC) is reduced to 1/4. The formulas are as follows:
$$ \text{FLOPs} = h \times w \times k^2 \times c_p^2 $$
$$ \text{MAC} = h \times w \times 2c_p + k^2 \times c_p^2 \approx h \times w \times 2c_p $$
where $h$ and $w$ are the spatial dimensions of the feature map, $k$ is the kernel size, and $c_p$ is the number of channels processed by PConv.
To maintain information flow integrity, FasterBlock follows the PConv operation with an inverted residual connection, which stabilizes gradient propagation via identity mapping. This design allows the backbone to focus computational resources on critical thermal signatures, dramatically reducing redundancy while maintaining sensitivity to the subtle, blurry features of infrared small targets. This makes the network better suited to meet the dual demands of accuracy and efficiency for UAV drone infrared detection tasks.
3. Experiments and Results
3.1 Experimental Setup
Datasets: Our primary experiments were conducted on the HIT-UAV dataset, a public collection of 2898 infrared thermal images captured by UAV drones at high altitudes. It covers diverse real-world scenes like schools, parking lots, and roads, and contains five object categories: Person, Bicycle, Car, Other Vehicle, and Don’t care, with approximately 25k instances in total. We split the data into training, validation, and test sets in a 7:2:1 ratio. To evaluate generalization, we also used the FLIR dataset, selecting the Person, Bicycle, and Car categories from its extensive thermal image collection.
Evaluation Metrics: We employ standard object detection metrics: Precision (P), Recall (R), Average Precision (AP), and mean Average Precision at an IoU threshold of 0.5 (mAP50). To assess model efficiency, we report the number of parameters (Params) and computational complexity in Giga-FLOPs (GFLOPs). The formulas for key metrics are:
$$ P = \frac{TP}{TP + FP}, \quad R = \frac{TP}{TP + FN} $$
$$ AP = \int_0^1 P(R) \, dR, \quad mAP = \frac{1}{n} \sum_{i=1}^{n} AP_i $$
where $TP$, $FP$, and $FN$ denote true positives, false positives, and false negatives, respectively, and $n$ is the number of classes.
Implementation Details: All models were trained from scratch (without pre-trained weights) for 300 epochs with an input size of 640×640, a batch size of 4, and the AdamW optimizer with an initial learning rate of 0.0001.
3.2 Comparison with State-of-the-Art Methods
We compared PGF-RTDETR against several mainstream detectors on the HIT-UAV test set. The results are summarized in Table 1. Our model demonstrates a superior balance between accuracy and efficiency. PGF-RTDETR achieved the highest Precision (93.8%) and a leading mAP50 of 85.7%, outperforming models like YOLOv8m (91.9% P, 84.2% mAP50) and the baseline RT-DETR-R18 (87.2% P, 83.1% mAP50). Notably, it also surpassed other DETR-based infrared detection models like Hybrid-DETR and FP-RTDETR by a significant margin in mAP50. Crucially, this performance gain is achieved alongside a substantial reduction in model size and computation. With only 15.7M parameters and 43.7 GFLOPs, PGF-RTDETR is markedly more lightweight than its counterparts, making it highly suitable for deployment on resource-constrained UAV drone platforms. While RT-DETR-R34 achieved a slightly higher Recall (79.8% vs. 79.1%), this comes at the cost of nearly double the parameters and GFLOPs, and a lower Precision, indicating a trade-off that favors our model’s design philosophy for efficient and accurate UAV drone detection.
| Model | P/% | R/% | mAP50/% | Params/M | GFLOPs/G |
|---|---|---|---|---|---|
| YOLOv5m | 88.9 | 76.2 | 83.1 | 25.0 | 64.4 |
| YOLOv8m | 91.9 | 75.1 | 84.2 | 25.8 | 78.7 |
| YOLOv9m | 89.5 | 75.8 | 83.4 | 20.0 | 77.6 |
| YOLOv10m | 84.1 | 75.7 | 80.5 | 16.4 | 63.4 |
| RT-DETR-R18 (Baseline) | 87.2 | 77.1 | 83.1 | 19.8 | 58.3 |
| RT-DETR-R34 | 89.5 | 79.8 | 85.0 | 31.1 | 88.8 |
| Hybrid-DETR | 91.4 | 73.3 | 82.6 | – | – |
| FP-RTDETR | 89.1 | – | 82.4 | 20.0 | 49.3 |
| PGF-RTDETR (Ours) | 93.8 | 79.1 | 85.7 | 15.7 | 43.7 |
3.3 Ablation Studies
We conducted systematic ablation experiments on the HIT-UAV dataset to validate the contribution of each proposed component, using RT-DETR-R18 as the baseline. The results are presented in Table 2. The incremental addition of the PCG, gConvC3, and FasterBlock modules consistently improves performance, demonstrating clear synergistic benefits.
Introducing the PCG module alone (Model B) boosts mAP50 by 1.4% (from 83.1% to 84.5%) with a negligible increase in parameters and GFLOPs, confirming its effectiveness in enhancing feature perception for small targets. Adding the gConvC3 module on top of PCG (Model E) leads to a significant jump in Precision to 92.1% and mAP50 to 85.3%, while further reducing GFLOPs, highlighting its role in adaptive feature refinement. Finally, incorporating the FasterBlock into the backbone alongside the other modules yields our complete PGF-RTDETR model (Model H). It achieves the best overall performance (93.8% P, 85.7% mAP50) while simultaneously reducing parameters by 20.7% and GFLOPs by 25.0% compared to the baseline. This conclusively proves that our improvements collectively enhance accuracy while achieving substantial model lightweighting, a critical advantage for UAV drone applications.
| Model | PCG | gConvC3 | FasterBlock | P/% | R/% | mAP50/% | GFLOPs/G | Params/M |
|---|---|---|---|---|---|---|---|---|
| A (Baseline) | – | – | – | 87.2 | 77.1 | 83.1 | 58.3 | 19.8 |
| B | √ | – | – | 88.2 | 78.2 | 84.5 | 58.6 | 20.0 |
| E | √ | √ | – | 92.1 | 79.0 | 85.3 | 51.7 | 18.8 |
| H (Ours) | √ | √ | √ | 93.8 | 79.1 | 85.7 | 43.7 | 15.7 |
3.4 Stability and Generalization Analysis
To ensure the reliability of our results, we performed stability analysis by training both the baseline and our final model five times with different random seeds. PGF-RTDETR showed consistent and superior performance with low standard deviation (e.g., mAP50 of $85.7\% \pm 0.14\%$ versus the baseline’s $83.1\% \pm 0.16\%$), confirming that the improvements are statistically significant and reproducible.
Furthermore, we evaluated the generalization capability on the FLIR dataset. As shown in Table 3, PGF-RTDETR maintains its advantage, outperforming the baseline RT-DETR-R18 and other models like YOLOv8m in mAP50. This demonstrates that the improvements endowed by our modules—enhanced multi-scale perception and adaptive filtering—are not dataset-specific but generalize well to other infrared detection tasks, reinforcing the robustness of our approach for diverse UAV drone sensing scenarios.
| Model | P/% | R/% | mAP50/% |
|---|---|---|---|
| YOLOv8m | 84.6 | 70.8 | 78.7 |
| RT-DETR-R18 | 84.4 | 74.1 | 82.6 |
| PGF-RTDETR (Ours) | 84.8 | 74.7 | 83.2 |
3.5 Visualization and Qualitative Analysis
Visual analysis provides intuitive insights into the model’s behavior. Attention heatmaps reveal that our PGF-RTDETR model generates more focused and precise activation regions on target objects compared to the baseline, effectively suppressing responses to background clutter. This aligns with the design goal of our PCG and gConvC3 modules. Qualitative detection results on challenging HIT-UAV scenes further illustrate the improvements. In scenarios with occluded persons, densely parked vehicles, or crowded pedestrian areas, PGF-RTDETR consistently identifies more small and obscured targets that the baseline model misses, while also reducing false positives on background structures. A per-category analysis of mAP50 shows that our model achieves gains across all classes, with the most pronounced improvement on the challenging “Don’t care” category, indicating a stronger ability to suppress irrelevant or ambiguous regions—a key capability for reducing false alarms in UAV drone operations.
4. Discussion and Limitations
The experimental results comprehensively validate the effectiveness of the PGF-RTDETR algorithm for UAV drone infrared small target detection. The integration of the PCG unit, gConvC3 module, and FasterBlock structure creates a synergistic effect that enhances multi-scale feature discrimination, adaptive feature refinement, and computational efficiency. This allows the model to achieve a superior balance between high detection accuracy (evidenced by gains in mAP50 and Precision) and low resource footprint (evidenced by reduced Params and GFLOPs), making it a practical solution for deployment on UAV drone platforms.
However, a critical analysis of failure cases reveals the current limitations and points to future research directions. In extremely challenging scenarios, such as when a target is exceptionally tiny and isolated with an extremely low signal-to-noise ratio, PGF-RTDETR may still fail to detect it, as the target’s feature signal can be overwhelmed in the deep network layers. Furthermore, the model can occasionally be deceived by background structures that exhibit thermal patterns highly similar to targets (e.g., grid-like windows resembling cars from a UAV drone‘s aerial view). This indicates that while our model has enhanced low-level feature perception and mid-level filtering, its high-level semantic reasoning capability in the face of structurally similar distractors can be further improved.
5. Conclusion
In this work, we presented PGF-RTDETR, an improved real-time detection transformer model tailored for the challenging task of infrared small target detection in imagery captured by UAV drones. To address the core problems of weak feature expression, background clutter, and model efficiency, we introduced three key innovations: a Polarized Channel-Gated (PCG) unit for enhanced multi-scale feature fusion and contrast, an Adaptive Convolution Enhancement module (gConvC3) for selective fine-grained feature extraction, and an efficient FasterBlock backbone structure for lightweight yet powerful feature representation. Extensive experiments on the HIT-UAV and FLIR datasets demonstrate that PGF-RTDETR significantly outperforms the baseline and other state-of-the-art methods in terms of detection accuracy while simultaneously reducing model complexity. It provides an effective and efficient technical pathway for advancing UAV drone-based infrared perception systems. Future work will focus on improving the model’s robustness to extremely weak signals and its high-level semantic discrimination against deceptive backgrounds, as well as exploring advanced model compression techniques to further optimize deployment efficiency on edge devices.
