The proliferation and misuse of unmanned aerial vehicles (UAVs) have escalated security threats, making the development of robust Anti-Drone technologies a critical imperative. Among various detection and countermeasure systems, infrared (IR) imaging-based tracking offers significant advantages such as compact size, operational stealth, and all-weather, day-and-night capability. Consequently, infrared-based Anti-Drone tracking has become a focal point in this domain. However, tracking small, feature-sparse IR drone targets amidst complex backgrounds, thermal cross-over, fast motion, and frequent disappearance-reappearance scenarios remains a formidable challenge.
To enhance the accuracy and stability of infrared Anti-Drone tracking in complex environments, this paper proposes a dynamic region focusing based long-term infrared object tracking algorithm. Our method is built upon the global Siamese tracking paradigm but ingeniously incorporates spatial-temporal constraints to guide the network’s focus. The core innovation lies in the proposed Spatio-Temporal joint Constraints based Dynamic Region Proposal Network (STC-DRPN), which dynamically concentrates the search area around potential target locations, effectively merging the anti-interference capability of local search with the re-detection capability of global search.
Introduction
The rapid advancement of UAV technology has facilitated countless applications but simultaneously introduced severe risks to public safety, privacy, and critical infrastructure. The unauthorized operation of drones near airports, government facilities, or large public gatherings necessitates effective Anti-Drone systems for detection, tracking, and neutralization. Infrared-based tracking is a cornerstone technology in such systems. Unlike visible light sensors, IR cameras capture thermal radiation, making them effective in low-light conditions, against cluttered backgrounds, and for distinguishing targets based on heat signatures. However, tracking drones in IR video sequences presents unique difficulties: (1) IR drones are typically small, lack color and rich texture information, and their heat signature can be weak or similar to background thermal clutter; (2) Drones are highly agile, capable of fast and unpredictable movements, and can easily move out of the frame or be temporarily occluded; (3) Environmental thermal cross-over from buildings, vehicles, or other heat sources, along with similar flying objects like birds, act as potent distractors.
Recent years have witnessed the dominance of Siamese network-based trackers in the visual object tracking community due to their favorable balance between accuracy and speed. Pioneering works like SiamFC formulated tracking as a similarity matching problem. Subsequent improvements like SiamRPN introduced region proposal networks for precise localization, and SiamRPN++ enabled the use of deeper backbone networks. However, most of these are short-term trackers operating under a local smoothness assumption, restricting the search to a small neighborhood around the previous target position. This makes them prone to failure when the drone undergoes fast motion or disappears and reappears.
Long-term trackers, designed to handle target loss and re-identification, are more suitable for Anti-Drone tasks. Existing long-term IR Anti-Drone trackers can be categorized into combination methods (e.g., TLD, SPLT) that switch between a local tracker and a global detector, and global tracking methods (e.g., GlobalTrack, SiamRCNN) that perform target search over the entire image. While global trackers offer a simpler, end-to-end architecture, their performance can degrade due to excessive background interference introduced by the global search, which dilutes the feature discriminability, especially against challenging distractors common in IR scenes.
To address these limitations, we propose a novel framework that adopts a from-global-to-local search strategy. We first construct a Siamese backbone with a Feature Pyramid Network (FPN) to extract robust multi-scale features for small IR drones. Then, we introduce the STC-DRPN. This sub-network leverages both the target’s appearance template and its predicted motion (via a Kalman filter) to generate a spatio-temporal probability map over the whole image. This map predicts where the drone is likely to be. Instead of searching uniformly everywhere, the network uses this map to focus its prior anchor boxes (the fundamental units for prediction) onto high-probability candidate regions. This dynamic focusing mechanism significantly reduces the number of negative background samples during feature learning and inference, thereby enhancing the model’s ability to distinguish the drone from difficult distractors. The final tracking result is obtained by classifying and regressing these focused anchor boxes.

Proposed Algorithm
The overall framework of our dynamic region focusing based Anti-Drone tracker is illustrated in Figure 1. It consists of two main components: a Siamese feature extraction network and the STC-DRPN sub-network.
Siamese Feature Extraction Network
Given the small size and lack of distinctive features of IR drones, single-scale features often fail to encapsulate both semantic and localization information adequately. We construct a Siamese backbone with a Feature Pyramid Network to harness multi-scale feature fusion.
Let $I_z$ denote the template image (usually the first frame) and $I_s$ denote the search image (subsequent frames). A shared backbone network $\phi(\cdot)$ extracts initial features: $\mathbf{z} = \phi(I_z)$ and $\mathbf{s} = \phi(I_s)$. A shared FPN $\varphi(\cdot)$ then processes these features to produce multi-scale template features $\{\mathbf{Z}_i\} = \varphi(\mathbf{z})$ and search features $\{\mathbf{S}_i\} = \varphi(\mathbf{s})$ for $i \in \{2,3,4\}$ (corresponding to P2, P3, P4 levels with strides of 4, 8, 16 pixels respectively, chosen for their higher resolution suitable for small targets). Finally, given the initial target bounding box $\mathbf{B}_1$, we employ ROIAlign to extract multi-scale template target features $\{\mathbf{F}_{t,i}\} = \text{ROIAlign}(\mathbf{Z}_i, \mathbf{B}_1)$.
Spatio-Temporal Joint Constraints based Dynamic Region Proposal Network (STC-DRPN)
The STC-DRPN is the core of our Anti-Drone tracker. It dynamically adjusts the search region by predicting where the target is likely to be, based on appearance and motion, rather than using a fixed local window or the entire image uniformly.
1. Target Location Prediction
This module predicts a probability map $\mathbf{F}_l \in \mathbb{R}^{H \times W \times 1}$ indicating the likelihood of the drone’s presence at every spatial location in the search image. The prediction is a fusion of spatial and temporal cues.
Spatial Location Prediction: This cue relies on appearance similarity. We use a depth-wise cross-correlation to match the template features $\mathbf{F}_t$ with the search features $\mathbf{S}$:
$$\mathbf{F}_m = \mathbf{F}_t \star \mathbf{S}$$
where $\star$ denotes the depth-wise cross-correlation operation. A $1\times1$ convolution $f_c$ followed by a sigmoid activation $\sigma$ generates the spatial location map:
$$\mathbf{F}_{sl} = \sigma(f_c(\mathbf{F}_m))$$
Temporal Location Prediction: This cue leverages motion continuity. We employ a Kalman filter to estimate the drone’s state $\mathbf{X} = [x_c, y_c, \dot{x}_c, \dot{y}_c]^T$. The state prediction $\hat{\mathbf{X}}_t$ and its associated error covariance $\mathbf{P}_t$ are obtained from the standard Kalman predict equations:
$$\hat{\mathbf{X}}_t = \mathbf{A} \mathbf{X}_{t-1}$$
$$\mathbf{P}_t = \mathbf{A} \mathbf{P}_{t-1} \mathbf{A}^T + \mathbf{Q}$$
where $\mathbf{A}$ is the state transition matrix (constant velocity model), and $\mathbf{Q}$ is the process noise covariance. The predicted position $(x_p, y_p)$ is Gaussianized onto a map $\mathbf{F}_{tl}$, where the value at pixel $\mathbf{m}_i$ is:
$$f_{tl}^i = \exp\left(-\alpha \|\mathbf{m}_i – [x_p, y_p]^T\|^2 / p_t\right)$$
Here, $\alpha$ is a scaling factor (set to 5), and $p_t$ is related to the position prediction variance. To handle abrupt motion, we propose a pseudo-innovation based机动 detection. The pseudo-innovation $\mathbf{n}_t$ is the difference between the Kalman prediction and the peak location from $\mathbf{F}_{sl}$. If the normalized innovation squared $\epsilon_t = \mathbf{n}_t^T \mathbf{L}_t^{-1} \mathbf{n}_t$ (where $\mathbf{L}_t$ is the innovation covariance) exceeds a threshold $\epsilon_{max}$ (from a $\chi^2$ distribution), we判定 the target motion is erratic and set $\mathbf{F}_{tl}$ to a uniform map (all ones), effectively disabling the temporal constraint for that frame.
Spatio-Temporal Fusion: The final target location probability map is obtained via element-wise multiplication (Hadamard product):
$$\mathbf{F}_l = \mathbf{F}_{sl} \odot \mathbf{F}_{tl}$$
This map selectively highlights regions that are both visually similar to the template and consistent with the expected motion. A threshold is applied to $\mathbf{F}_l$ to obtain a binary mask defining the dynamic search region. Anchor points are placed only within this high-probability region.
2. Anchor Shape Prediction
At each selected anchor point, we predict an optimal anchor box shape (width $w$, height $h$) rather than using pre-defined, fixed aspect ratios. This improves adaptation to the varying size and aspect ratio of drones. The shape is predicted from the feature $\mathbf{F}_m$ through a small sub-network. Following Guided Anchoring, we predict transformed parameters $(dw, dh)$ for stability:
$$w = s \cdot \tau \cdot e^{dw}, \quad h = s \cdot \tau \cdot e^{dh}$$
where $s$ is the stride of the feature map and $\tau$ is a scale factor (set to 8).
3. Feature Adaptive Transformation
Since anchor boxes now have varying shapes, a standard convolutional feature extraction over a fixed grid may not align well. We employ a feature adaptation module $\mathcal{G}$ based on deformable convolution. For an anchor at location $i$ with predicted shape $(w_i, h_i)$, the module generates an offset field $\mathbf{f}_{off}^i$ and uses it to deform the sampling grid of a convolution applied to the local feature $\mathbf{f}_m^i$, producing a shape-adaptive feature $\mathbf{f}_a^i$:
$$\mathbf{f}_a^i = \mathcal{G}(\mathbf{f}_m^i, w_i, h_i)$$
These adaptive features $\{\mathbf{f}_a^i\}$ are then fed into the final RPN heads (classification and regression). Crucially, we use masked convolutions that operate only on features corresponding to the focused anchor points, ignoring irrelevant background areas. This further reduces computational cost and negative sample interference.
Loss Function
The entire network is trained end-to-end with a multi-task loss:
$$\mathcal{L}_{total} = \mathcal{L}_{loc} + \lambda_1 \mathcal{L}_{shape} + \lambda_2 (\mathcal{L}_{cls} + \mathcal{L}_{reg})$$
where $\lambda_1=1$ and $\lambda_2=0.1$ are balancing weights.
- $\mathcal{L}_{loc}$: A Focal Loss applied to the spatial location prediction $\mathbf{F}_{sl}$ to handle foreground-background pixel imbalance.
$$\mathcal{L}_{loc} = -\frac{1}{N_p} \sum_i [\beta (1-p_i)^\gamma p_i^* \log(p_i) + (1-\beta) (p_i)^\gamma (1-p_i^*) \log(1-p_i)]$$ - $\mathcal{L}_{shape}$: A Bounded IoU Loss for anchor shape prediction, encouraging predicted $(w,h)$ to match the ground-truth $(w^*, h^*)$ of the assigned anchor.
$$\mathcal{L}_{shape} = \mathcal{L}_1\left(1 – \frac{\min(w, w^*)}{\max(w, w^*)}\right) + \mathcal{L}_1\left(1 – \frac{\min(h, h^*)}{\max(h, h^*)}\right)$$
where $\mathcal{L}_1$ is the Smooth L1 loss. - $\mathcal{L}_{cls}$: Binary cross-entropy loss for anchor box classification.
- $\mathcal{L}_{reg}$: Smooth L1 loss for bounding box regression.
Tracking Pipeline
The complete tracking workflow for an incoming image sequence $\{I_1, I_2, …, I_L\}$ with initial box $\mathbf{B}_1$ is summarized in the algorithm below.
1: Extract multi-scale template target features {F_t} from I_1 using B_1.
2: for frame f = 2 to L do
3: Extract multi-scale search features {S} from I_f.
4: Compute spatial map F_sl via depth-wise cross-correlation of {F_t} and {S}.
5: Predict temporal map F_tl using Kalman filter and机动 detection.
6: Fuse: F_l = F_sl ⊙ F_tl.
7: Place anchor points only where F_l > threshold.
8: Predict shape (w_i, h_i) for each anchor point i.
9: Generate adaptive feature f_a^i for each anchor using G.
10: Classify and regress each anchor based on f_a^i.
11: Select the candidate with highest score after NMS as result B_f.
12: Update Kalman filter state with B_f.
13: end for
Experiments and Analysis
Dataset and Evaluation Metrics
We evaluate our method on the challenging Anti-UAV dataset, a large-scale benchmark specifically for vision-based UAV tracking. It contains 318 high-quality IR video sequences (approx. 1000 frames each) captured in diverse real-world dynamic environments, featuring challenges like thermal crossover, occlusion, fast motion, scale variation, and similar distractors.
We use the One-Pass Evaluation (OPE) protocol. Key metrics include:
- Precision: Percentage of frames where the center location error (CLE) between predicted and ground-truth box is below a threshold (20 pixels). Plot shows precision vs. threshold.
- Success Rate: Percentage of frames where the intersection-over-union (IoU) between predicted and ground-truth box is above a threshold. Plot shows success rate vs. threshold. The Area Under Curve (AUC) is reported.
- Average Accuracy (Avg. Acc): A metric for long-term tracking defined in the Anti-UAV benchmark, considering both tracking accuracy and the ability to report “target lost”.
- FPS: Frames processed per second, indicating speed.
Implementation Details
Our model uses ResNet-50 with FPN as the backbone, initialized with COCO pre-trained weights. It is first pre-trained on the LSOTB-TIR dataset and then fine-tuned on the Anti-UAV training set. We use SGD optimizer with momentum 0.95 and weight decay 1e-5. Training is conducted for 32 epochs with an initial learning rate of 1e-3, reduced by a factor of 10 at epochs 16 and 28. Input images are resized to 640×512. Testing is performed on a system with an Intel i9-9940X CPU and NVIDIA RTX 2080Ti GPUs.
Quantitative Comparison and Analysis
We compare our proposed tracker against nine state-of-the-art (SOTA) trackers, including GlobalTrack, OSTrack, STARK, SiamRPN++LT, ATOM, and others. All competitors are retrained/fine-tuned on the Anti-UAV training set for a fair comparison.
Overall Performance: The precision and success plots on the Anti-UAV test set are shown in Figure 2. Our algorithm achieves the highest precision and success rate across most thresholds. The quantitative results are summarized in Table 1.
| Tracker | Precision↑ | Success Rate (AUC)↑ | Average Accuracy↑ | FPS↑ |
|---|---|---|---|---|
| DSST | 0.490 | 0.349 | 0.354 | 31.2 |
| SiamFC | 0.510 | 0.369 | 0.375 | 60.2 |
| ECO | 0.618 | 0.437 | 0.444 | 7.5 |
| ATOM | 0.711 | 0.484 | 0.490 | 28.7 |
| SiamRPN++LT | 0.756 | 0.501 | 0.507 | 26.3 |
| STARK | 0.843 | 0.588 | 0.607 | 33.5 |
| CSWinTT | 0.858 | 0.614 | 0.623 | 8.5 |
| OSTrack | 0.871 | 0.638 | 0.647 | 22.6 |
| GlobalTrack | 0.889 | 0.639 | 0.648 | 10.2 |
| Ours | 0.895 | 0.649 | 0.656 | 18.5 |
Our tracker achieves the best Precision (89.5%), Success Rate (64.9%), and Average Accuracy (65.6%), while maintaining a real-time speed of 18.5 FPS. It outperforms the strong baseline GlobalTrack, demonstrating the effectiveness of our dynamic region focusing mechanism for Anti-Drone tracking.
Attribute-Based Performance: We further analyze performance under specific challenge attributes defined in the Anti-UAV benchmark: Out-of-View (OV), Occlusion (OC), Fast Motion (FM), Scale Variation (SV), Thermal Crossover (TC), and Low Resolution (LR). The success plots for these attributes are shown in Figure 3. Our method consistently ranks first or among the top performers across all challenges, particularly excelling in SV, TC, and LR scenarios. This validates the robustness of our FPN-based multi-scale features and the STC-DRPN’s ability to suppress background and thermal distractors.
Ablation Studies
To validate the contribution of each component, we conduct ablation experiments on the Anti-UAV test set. The results are shown in Table 2 and Table 3.
| Variant | FPN | STC-DRPN | QG-RPN | Precision↑ | Success Rate↑ | Avg. Acc↑ |
|---|---|---|---|---|---|---|
| 1 | √ | 0.854 | 0.612 | 0.617 | ||
| 2 | √ | 0.877 | 0.632 | 0.638 | ||
| 3 | √ | √ | 0.871 | 0.627 | 0.632 | |
| 4 (Ours) | √ | √ | 0.895 | 0.649 | 0.656 |
Table 2 shows that replacing the standard region proposal network (QG-RPN) with our STC-DRPN brings significant gains (+2.0~2.3% across metrics). Adding the FPN also provides a consistent boost, confirming the benefit of multi-scale feature fusion for small IR drone targets.
| Spatial | Temporal | Precision↑ | Success Rate↑ | Avg. Acc↑ |
|---|---|---|---|---|
| √ | 0.875 | 0.632 | 0.638 | |
| √ | 0.713 | 0.474 | 0.485 | |
| √ | √ | 0.895 | 0.649 | 0.656 |
Table 3 demonstrates the importance of fusing both spatial (appearance) and temporal (motion) cues. Using only temporal prediction performs poorly, as it’s essentially a local method. Using only spatial prediction is better but still inferior to the joint spatio-temporal fusion, which achieves the best performance by synergistically combining global re-detection capability with motion-guided local focusing.
Conclusion
In this paper, we have presented a novel dynamic region focusing based infrared long-term object tracking algorithm specifically designed for challenging Anti-Drone applications. The core of our method is the Spatio-Temporal joint Constraints based Dynamic Region Proposal Network (STC-DRPN), which intelligently integrates target appearance and motion information to predict a focused search area. This mechanism effectively reduces background clutter interference inherent in global search while retaining the crucial ability to re-detect lost targets. Coupled with a Feature Pyramid Network for robust multi-scale feature extraction, our tracker demonstrates superior performance on the demanding Anti-UAV benchmark. It achieves state-of-the-art results in terms of precision (89.5%), success rate (64.9%), and average accuracy (65.6%) at a real-time speed of 18.5 FPS. The algorithm shows remarkable robustness in handling fast motion, thermal crossover, occlusion, and similar distractor challenges commonly encountered in real-world Anti-Drone scenarios.
One limitation is the performance on extremely small drones (e.g., less than 7×7 pixels), as the feature stride in our FPN may cause the target signal to be diluted. Future work will focus on designing resolution-preserving feature networks and incorporating small-target enhancement modules to further push the boundaries of IR-based Anti-Drone tracking in all operational conditions.
