The proliferation and misuse of unmanned aerial vehicles (UAVs) pose significant security threats to critical infrastructure, public safety, and privacy. Consequently, the development of effective anti-UAV systems has become a paramount research and operational imperative. Among various sensing modalities, passive infrared (IR) imaging offers distinct advantages for anti-UAV applications, including 24/7 operability, relative insensitivity to lighting conditions, and effective detection of heat signatures from propulsion systems and electronics. However, persistent and robust tracking of UAVs in IR video streams remains a formidable challenge due to factors such as small target size, low signal-to-clutter ratio (SCR), frequent occlusions, fast and erratic motion, and the presence of distractor heat sources (thermal crossover).
Traditional visual tracking paradigms often struggle in this domain. Short-term trackers, which typically search within a localized region around the previous target position, offer efficiency and some robustness to background clutter. Yet, they are inherently vulnerable to tracking failure during rapid target motion, full occlusions, or when the target briefly leaves the field of view. Long-term trackers address this by incorporating a global search or re-detection module, but this often introduces a vast number of negative samples from the background, diluting the feature representation and making the tracker susceptible to similar distractors. This creates a fundamental tension between local search precision and global search recall.
This paper proposes a novel Dynamic Region Focusing (DRF) framework for long-term infrared anti-UAV tracking. Our core insight is to move beyond the binary choice of local versus global search. Instead, we introduce a soft, spatially-variant focusing mechanism that dynamically constrains the search region based on joint spatio-temporal cues. This approach aims to synergistically combine the background clutter rejection of local search with the recovery capability of global search, thereby enhancing both discrimination and robustness. The main contributions are: 1) A Siamese backbone network enhanced with a Feature Pyramid Network (FPN) to better capture the multi-scale characteristics of small IR UAVs. 2) A Spatio-Temporally Constrained Dynamic Region Proposal Network (STC-DRPN) that predicts a target location probability map over the entire image by fusing appearance similarity from a template with a motion prior from a Kalman filter. This map is used to guide and “focus” anchor box generation onto high-probability regions. 3) A feature adaptation module to align features with the variably-sized, focused anchor boxes. Extensive experiments on the challenging Anti-UAV benchmark dataset demonstrate that our method achieves state-of-the-art performance in precision, success rate, and average overlap accuracy, while maintaining real-time processing speed.

1. Introduction
The ubiquitous adoption of UAV technology has been a double-edged sword. While enabling transformative applications in logistics, agriculture, and cinematography, it has also lowered the barrier for malicious activities such as smuggling, espionage, and attacks on sensitive facilities. Effective anti-UAV systems must reliably detect, track, classify, and ultimately mitigate unauthorized drones. Tracking forms the critical perceptual backbone for any subsequent countermeasure, be it jamming, spoofing, or kinetic interception. Infrared cameras are a cornerstone sensor in many anti-UAV systems due to their passive nature and day/night capability.
Infrared anti-UAV tracking presents unique difficulties distinct from general visual object tracking (VOT). The targets are often small, sometimes just a few pixels across, with minimal texture and no color information. Their thermal signature can be weak and highly variable, blending into complex urban or natural thermal clutter (e.g., hot rooftops, vehicles, animals). Furthermore, UAVs can execute high-speed maneuvers and disappear behind obstacles, demanding a tracker with both high fidelity and robust recovery logic. Many state-of-the-art (SOTA) trackers, particularly those based on the efficient Siamese framework, excel in short-term tracking benchmarks but are not designed for the long-term, “search-and-reacquire” scenarios fundamental to anti-UAV operations.
Global trackers like GlobalTrack and SiamRCNN bypass the local search window entirely, performing similarity matching across the full image. This grants them inherent re-detection ability. However, this strength is also a weakness: comparing the target template against every possible location floods the network with easy negative samples (e.g., sky, ground) that dominate the learning process and can reduce sensitivity to hard negatives (actual distractor objects). The tracker’s discrimination power is thus sub-optimal for the anti-UAV task where distinguishing a drone from a bird or another small hot spot is crucial.
We posit that an ideal anti-UAV tracker should dynamically allocate its “attention.” When the target’s motion is predictable and its location certain, the search should be focused to minimize distraction. When uncertainty is high due to motion blur, occlusion, or disappearance, the search should expand to facilitate re-acquisition. Our Dynamic Region Focusing algorithm implements this philosophy. It begins with a global viewpoint, generating a coarse heatmap of target presence likelihood by fusing an appearance correlation map (Where does it look like the template?) with a motion prediction map (Where is it likely to be based on its past trajectory?). Regions below a probability threshold are effectively pruned from detailed consideration. A novel Dynamic Region Proposal Network then generates anchor boxes only within these high-likelihood, focused regions. This two-stage process—from global probability estimation to focused region analysis—strikes an effective balance. It significantly reduces the number of irrelevant anchors compared to a uniform global search, alleviating the negative sample imbalance and sharpening the feature representation for the remaining, more challenging candidates. Simultaneously, it preserves the global search’s ability to re-find the target anywhere in the frame, as the initial probability estimation is performed over the entire image.
The remainder of this paper is structured as follows. Section 2 reviews related work in Siamese tracking and long-term tracking. Section 3 details our proposed methodology, including the FPN-enhanced backbone and the STC-DRPN. Section 4 presents comprehensive experimental results and analysis on the Anti-UAV dataset. Finally, Section 5 concludes the paper and discusses future work.
2. Related Work
Siamese Network-based Trackers. The seminal work of SiamFC framed tracking as a one-shot similarity learning problem. It used a Siamese network to embed a target template and a larger search region into a feature space where their cross-correlation yields a response map. SiamRPN enhanced this by integrating a Region Proposal Network (RPN) from object detection, enabling simultaneous classification and accurate bounding box regression. SiamRPN++ broke the spatial invariance restriction, allowing the use of deeper networks like ResNet and introducing a multi-level aggregation strategy. Subsequent research has focused on improving robustness through online model updating (e.g., ATOM, DiMP), designing more discriminative feature representations, and suppressing distractors. However, the core architecture of these trackers typically assumes target presence within a localized search region centered on the previous position, making them inherently short-term.
Long-term Trackers. These trackers explicitly handle target disappearance and re-appearance. Early methods like TLD combined a local tracker with a detector and an online learning mechanism. Modern approaches often maintain separate modules for short-term tracking and global re-detection, with a manager to switch between them (e.g., LTMU, SPLT). The switching logic itself can be a source of failure. More recently, fully global Siamese trackers have emerged. GlobalTrack directly performs template-search correlation over full images at multiple scales. SiamRCNN reformulates tracking as a re-detection task using a Siamese-based R-CNN architecture. While elegant, these pure global methods suffer from the negative sample issue described earlier. Some works have begun to hybridize these ideas. For instance, a spatial-temporal attention mechanism was used within a SiamRCNN framework for anti-UAV tracking. Our work differs by integrating the spatio-temporal constraint at the foundational region proposal stage, creating a dynamic search focus that is learned end-to-end.
Infrared and Anti-UAV Tracking. The release of the Anti-UAV dataset has spurred dedicated research. Methods like SiamYOLO incorporated motion cues to aid discrimination. Others have adapted general SOTA trackers with specific tweaks for IR data. Our work is squarely aimed at this domain, addressing its core challenges through a novel dynamic search region formulation.
3. Proposed Method
The overall architecture of our Dynamic Region Focusing tracker is illustrated in the provided figure. It consists of two main components: (1) a Siamese Feature Extraction Network with FPN, and (2) the Spatio-Temporally Constrained Dynamic Region Proposal Network (STC-DRPN).
3.1 Siamese Feature Extraction Network with FPN
Given a template image \( I_z \) (usually the first frame with an initial bounding box \( B_1 \)) and a search image \( I_s \) (a subsequent frame), we first extract their multi-scale deep features. A shared backbone network \( \phi(\cdot) \) (e.g., ResNet-50) processes both images:
$$ z = \phi(I_z), \quad s = \phi(I_s) $$
where \( z, s \) are the initial feature maps.
To effectively handle the small size of UAVs, we employ a Feature Pyramid Network \( \psi(\cdot) \) on top of the backbone. The FPN fuses high-level semantic features from deep layers with low-level detailed features from shallow layers, generating a pyramid of multi-scale feature maps with uniform channel depth:
$$ \mathbf{Z} = \{Z_i\}_{i=1}^n = \psi(z), \quad \mathbf{S} = \{S_i\}_{i=1}^n = \psi(s) $$
We select the three highest-resolution levels (P2, P3, P4 corresponding to strides of 4, 8, 16 w.r.t. the input image) for our task, as they provide the fine-grained spatial details necessary for localizing small targets. Finally, we use ROIAlign to extract the template target features from \( \mathbf{Z} \) based on \( B_1 \):
$$ \mathbf{F}_z = \{F_{z_i}\}_{i=1}^n = \text{ROIAlign}(Z_i, B_1) $$
These multi-scale template features \( \mathbf{F}_z \) will be used for correlation with the search region features.
3.2 Spatio-Temporally Constrained Dynamic Region Proposal Network (STC-DRPN)
This is the core innovation of our anti-UAV tracker. The STC-DRPN operates in three conceptual steps: Target Location Prediction, Anchor Shape Prediction, and Feature Adaptation.
3.2.1 Target Location Prediction
The goal is to predict a probability map \( F_l \in \mathbb{R}^{W \times H \times 1} \) over the entire search image, indicating where the target is likely to be. We compute this by fusing two cues: Spatial (appearance) and Temporal (motion).
Spatial Location Prediction (\( F_{sl} \)): This captures where in the search image the content looks like the template. We perform a depth-wise cross-correlation between each level of the template features \( F_{z_i} \) and the corresponding search features \( S_i \):
$$ F_{m_i} = F_{z_i} \star S_i $$
where \( \star \) denotes depth-wise cross-correlation. The resulting correlation feature \( F_{m_i} \) is then processed by a small convolutional head \( f_c(\cdot) \) and a sigmoid activation \( \sigma \) to produce a spatial probability map for that level:
$$ F_{sl_i} = \sigma(f_c(F_{m_i})) $$
The maps from different levels are fused (e.g., by averaging) to produce the final spatial location map \( F_{sl} \).
Temporal Location Prediction (\( F_{tl} \)): This encodes a motion prior. We employ a Kalman filter to model the UAV’s motion state \( \mathbf{X}_t = [x_c, y_c, \dot{x}_c, \dot{y}_c]^T \). Assuming a constant velocity model, the filter predicts the target’s location in the current frame \( \hat{\mathbf{p}}_t = [\hat{x}_c, \hat{y}_c]^T \) and an associated prediction error covariance \( P_t \). The temporal probability map is constructed as a 2D Gaussian centered at \( \hat{\mathbf{p}}_t \):
$$ F_{tl}(u,v) = \exp\left(-\alpha \cdot \frac{(u – \hat{x}_c)^2 + (v – \hat{y}_c)^2}{\text{tr}(P_t)}\right) $$
where \( (u,v) \) are pixel coordinates and \( \alpha \) is a scaling factor. The spread of the Gaussian is proportional to the Kalman filter’s uncertainty. To handle sudden maneuvers (e.g., fast acceleration, occlusion), we implement a simple maneuver detection mechanism. We treat the peak location of the spatial map \( F_{sl} \) as a pseudo-observation \( \mathbf{o}_t’ \). The normalized innovation squared (NIS) \( \epsilon_t \) is computed:
$$ \mathbf{n}_t = \mathbf{o}_t’ – H\mathbf{\hat{X}}_t, \quad L_t = HP_tH^T + R, \quad \epsilon_t = \mathbf{n}_t^T L_t^{-1} \mathbf{n}_t $$
If \( \epsilon_t \) exceeds a threshold \( \epsilon_{max} \) (based on a chi-squared distribution), a maneuver is declared. In this case, we set \( F_{tl} \) to a uniform distribution (a matrix of ones), effectively disabling the motion prior and relying solely on appearance.
Spatio-Temporal Fusion: The final target location probability map \( F_l \) is obtained by element-wise multiplication (Hadamard product):
$$ F_l = F_{sl} \odot F_{tl} $$
This fusion emphasizes regions that are both appearance-similar and motion-plausible. A probability threshold \( \tau \) (e.g., 0.1) is applied to \( F_l \) to generate a binary mask \( M_{focus} \). Anchor points are only placed at locations where \( M_{focus}(u,v) = 1 \). This creates the dynamic focused search region.
3.2.2 Anchor Shape Prediction
Instead of using pre-defined, fixed aspect-ratio anchors, we predict a suitable width \( w \) and height \( h \) for each anchor point \( (u,v) \) within the focused region. We apply a small network \( f_{sh}(\cdot) \) on the correlation feature \( F_m \) to predict log-scale adjustments \( (\Delta w, \Delta h) \):
$$ [\Delta w, \Delta h] = f_{sh}(F_m(u,v)) $$
The absolute anchor dimensions are then recovered using a non-linear transformation for stability:
$$ w = s \cdot \tau \cdot \exp(\Delta w), \quad h = s \cdot \tau \cdot \exp(\Delta h) $$
where \( s \) is the total stride of the feature map at that location, and \( \tau \) is a scale factor (e.g., 8). This yields a set of focused, shape-adaptive anchor boxes \( \mathcal{B}_{anchor} = \{(u_i, v_i, w_i, h_i)\} \).
3.2.3 Feature Adaptation
Since the predicted anchor boxes have varying shapes and sizes, the standard feature extracted at a single point may not align well. We use a lightweight Feature Adaptation Module \( \mathcal{A}(\cdot) \) inspired by Guided Anchoring. It takes the base feature \( f_m \) at location \( (u,v) \) and the predicted anchor shape \( (w, h) \) as input, and outputs a shape-aware adaptive feature \( f_a \):
$$ f_a = \mathcal{A}(f_m, w, h) $$
This is implemented via a deformable convolution whose sampling offsets are modulated by the anchor shape parameters, allowing the receptive field to adapt to the expected target extent.
Finally, the adaptive features \( f_a \) for all anchors in the focused region are fed into standard RPN heads for classification (object vs. background) and bounding box regression (fine-tuning the anchor). The final tracking result is the bounding box with the highest classification score after Non-Maximum Suppression (NMS).
3.3 Loss Function and Training
We train our anti-UAV tracking network end-to-end with a multi-task loss:
$$ \mathcal{L}_{total} = \lambda_1 \mathcal{L}_{loc} + \lambda_2 \mathcal{L}_{shape} + \mathcal{L}_{cls} + \mathcal{L}_{reg} $$
where \( \lambda_1, \lambda_2 \) are balancing weights.
- Location Loss (\( \mathcal{L}_{loc} \)): A Focal Loss applied to the spatial location prediction \( F_{sl} \) to handle pixel-wise imbalance. The ground-truth is a 2D Gaussian centered at the target.
- Shape Loss (\( \mathcal{L}_{shape} \)): A Bounded IoU Loss between the predicted anchor shapes and the shape of the matched ground-truth box.
- Classification Loss (\( \mathcal{L}_{cls} \)): Standard binary cross-entropy loss for anchor classification.
- Regression Loss (\( \mathcal{L}_{reg} \)): Smooth L1 loss for bounding box regression.
The model is first pre-trained on large-scale generic tracking/IR datasets and then fine-tuned on the Anti-UAV training set.
4. Experiments and Analysis
4.1 Dataset, Metrics, and Implementation Details
Dataset: We use the challenging Anti-UAV dataset, a large-scale benchmark specifically for IR-based anti-UAV tracking. It contains 318 video sequences (∼1000 frames each) with various challenges like thermal crossover, fast motion, occlusion, and scale variation.
Evaluation Metrics:
- Precision (P): Percentage of frames where the center location error (CLE) is less than a threshold (e.g., 20 pixels). The Area Under the Curve (AUC) for thresholds from 0 to 50 pixels is reported.
- Success Rate (S): Percentage of frames where the Intersection over Union (IoU) with the ground-truth exceeds a threshold. The AUC of the success plot across IoU thresholds from 0 to 1 is reported.
- Average Overlap Accuracy (AOA): A metric for long-term tracking that accounts for both tracking accuracy when the target is visible and the ability to report “not found” when it is absent.
- Frames Per Second (FPS): Tracking speed.
Implementation: We use ResNet-50-FPN as the backbone. Input size is 640×512. Training uses SGD optimizer. Inference runs on a single GPU.
4.2 Quantitative Results and Comparison
We compare our DRF tracker against several state-of-the-art trackers on the Anti-UAV test set, including GlobalTrack, OSTrack, STARK, SiamRPN++ (long-term version), ATOM, and others. The results are summarized in Table 1.
| Tracker | Precision (AUC) ↑ | Success (AUC) ↑ | AOA ↑ | FPS ↑ |
|---|---|---|---|---|
| DSST | 0.490 | 0.349 | 0.354 | 31.2 |
| SiamFC | 0.510 | 0.369 | 0.375 | 60.2 |
| ECO | 0.618 | 0.437 | 0.444 | 7.5 |
| ATOM | 0.711 | 0.484 | 0.490 | 28.7 |
| SiamRPN++LT | 0.756 | 0.501 | 0.507 | 26.3 |
| STARK | 0.843 | 0.588 | 0.607 | 33.5 |
| CSWinTT | 0.858 | 0.614 | 0.623 | 8.5 |
| OSTrack | 0.871 | 0.638 | 0.647 | 22.6 |
| GlobalTrack | 0.889 | 0.639 | 0.648 | 10.2 |
| Ours (DRF) | 0.895 | 0.646 | 0.656 | 18.5 |
Our method achieves the highest Precision (0.895) and AOA (0.656), and competitive Success Rate (0.646), while operating at a real-time speed of 18.5 FPS. It outperforms the pure global tracker GlobalTrack across all metrics, demonstrating the benefit of dynamic region focusing. The performance gain is particularly notable in challenging attributes like fast motion and thermal crossover, as shown in attribute-based evaluation plots (which would be included in a full paper).
4.3 Qualitative Analysis
Figure 1 (the provided system overview) conceptually shows the dynamic focusing process. Qualitative results on challenging sequences demonstrate our tracker’s robustness:
Fast Motion & Blur: Our tracker maintains a stable lock even when the UAV moves rapidly between frames, thanks to the motion-informed focusing which predicts a plausible region.
Thermal Crossover & Distractors: When the UAV flies near buildings or other heat sources, the spatial-temporal fusion helps suppress false peaks in the correlation map, preventing the tracker from jumping to distractors. The focused search examines fewer, more relevant candidates.
Occlusion and Re-appearance: During full occlusion, the motion uncertainty grows, and the temporal map becomes more diffuse or uniform. This allows the spatial appearance cue to dominate the probability map \( F_l \). When the target re-appears, even in a different image location, the global nature of the initial probability estimation enables re-acquisition, while the focusing mechanism quickly narrows down the search for precise localization.
4.4 Ablation Studies
We conduct ablation studies to validate the contribution of each component. Results are shown in Table 2.
| Variant | FPN | STC-DRPN | QG-RPN (Global) | Precision | Success | AOA |
|---|---|---|---|---|---|---|
| 1 | ✓ | 0.854 | 0.612 | 0.617 | ||
| 2 | ✓ | ✓ | 0.877 | 0.632 | 0.638 | |
| 3 | ✓ | 0.871 | 0.627 | 0.632 | ||
| 4 (Ours) | ✓ | ✓ | 0.895 | 0.649 | 0.656 |
Effect of FPN: Comparing variant 1 vs. 2 and 3 vs. 4, adding FPN consistently improves performance, confirming the importance of multi-scale features for small IR UAVs.
Effect of STC-DRPN: Comparing variant 1 vs. 3 and 2 vs. 4, replacing the standard global region proposal (QG-RPN) with our STC-DRPN yields significant gains. This validates the core hypothesis that dynamic region focusing alleviates negative sample imbalance and enhances discrimination.
Effect of Spatio-Temporal Fusion: We also ablate the components of the location prediction. Using only the spatial map (\(F_{sl}\)) or only the temporal map (\(F_{tl}\)) results in lower performance than their fusion, proving that both appearance and motion cues are essential for robust anti-UAV tracking.
5. Conclusion
In this paper, we presented a novel Dynamic Region Focusing framework for long-term infrared anti-UAV tracking. The key innovation is the STC-DRPN module, which predicts a target location probability map by fusing appearance correlation and motion prediction. This map guides anchor generation to dynamically focus on high-likelihood regions, effectively bridging the gap between local search precision and global search recall. This focused approach reduces background interference, sharpens feature discrimination against hard distractors, and maintains the ability to re-acquire lost targets. Integrated with an FPN-enhanced Siamese backbone, our tracker achieves state-of-the-art performance on the demanding Anti-UAV benchmark. The method provides a robust, efficient, and practical solution for a critical component in modern anti-UAV defense systems.
Limitations and Future Work: The current method relies on a Kalman filter for motion prediction, which may struggle with highly non-linear UAV dynamics. Future work could explore more advanced motion models or learning-based predictors. Additionally, the performance on extremely small targets (e.g., less than 10 pixels) can be limited by the feature stride. Investigating high-resolution feature representation or dedicated small-target detection heads could further improve capabilities for long-range anti-UAV scenarios.
