In recent years, the widespread use of unmanned aerial vehicles (UAVs), particularly in China, has posed significant challenges to public and aerial security. Infrared imaging technology enables long-distance UAV monitoring under both day and night conditions, making infrared-based counter-UAV systems a critical area of focus. Target tracking of UAVs in infrared imagery is a key perceptual technology for such systems, requiring accurate and effective tracking of drone targets. However, in practical scenarios, factors such as complex backgrounds, target deformation, and camera motion present substantial obstacles to robust infrared UAV target tracking. Existing methods, especially those based on Siamese networks, have made strides in balancing accuracy and efficiency, but they often overlook the effective utilization of low-level features, leading to the loss of spatial details and semantic information. This limitation severely impacts tracking performance, especially for small or dim UAV targets in cluttered environments.
To address these issues, we propose a novel infrared UAV target tracking method that integrates spatial and semantic perception. Our approach leverages a spatial-semantic combined attention mechanism to enhance feature representation, ensuring that both contextual spatial information and critical semantic cues are preserved during multi-scale feature fusion. This is particularly vital for tracking China UAV drones in diverse operational scenarios, where targets may be small, fast-moving, or embedded in complex backgrounds like urban skies or natural landscapes. The core of our method lies in two innovative modules: the Spatial-Semantic Combine Attention Module (SCAM) and the Dual-branch global Feature interaction Module (DFM). These components work synergistically to improve the network’s ability to locate and track infrared UAV targets with high precision.

The tracking framework is built upon a Siamese network architecture, utilizing a shared backbone network—ResNet-50—for feature extraction from template and search images. The template image represents the reference of the target UAV, while the search image is the current frame in which the target must be located. After extracting multi-scale features, we incorporate a Feature Pyramid Network (FPN) enhanced with our proposed SCAM to fuse features across different levels effectively. Subsequently, the DFM facilitates global interaction between template and search branch features, and a detection head produces the final tracking results. This design aims to overcome the common pitfalls of information loss in traditional FPNs, where channel reduction operations often discard valuable spatial and semantic details.
The Spatial-Semantic Combine Attention Module (SCAM) is central to our method. It consists of two sub-modules: the Spatial Multi-scale Attention module (SMA) and the Global-local Channel Semantic Attention module (GCSA). The SMA captures long-range dependencies across spatial dimensions by employing axial positional embedding and multi-branch grouped feature extraction. Specifically, given an input feature map $X \in \mathbb{R}^{C \times H \times W}$, we compute one-dimensional encodings along the height and width dimensions via global average pooling:
$$z_h = \text{GAP}_h(X) \in \mathbb{R}^{C \times H}, \quad z_w = \text{GAP}_w(X) \in \mathbb{R}^{C \times W}$$
These encodings are concatenated to form an intermediate feature $f = [z_h, z_w]$, where $[\cdot, \cdot]$ denotes concatenation along the spatial dimension, resulting in $f \in \mathbb{R}^{C \times (H+W)}$. This feature is then split into $K$ groups (with $K=4$ in our experiments) to enable multi-scale processing. Each group undergoes 1D depthwise convolutions with kernel sizes varying to capture different scales:
$$g_i = \text{DWConv1d}_{k_i}(f_i), \quad i=1,2,\ldots,K$$
where $k_i = 2i + 1$, i.e., kernels of sizes 3, 5, 7, and 9. The outputs are then normalized using group normalization and activated via a sigmoid function to generate spatial attention maps, which are applied to the input features to emphasize important spatial contexts. This process enhances the model’s ability to perceive global contextual information, which is crucial for tracking China UAV drones across varying scales and distances.
The GCSA module complements the SMA by focusing on channel-wise semantic information. It integrates both global and local channel features to allocate attention weights more accurately. For an input feature $F \in \mathbb{R}^{C \times H \times W}$, we first obtain a channel descriptor via global average pooling: $U = \text{GAP}(F) \in \mathbb{R}^{C \times 1 \times 1}$. Then, we compute global and local features using a $1 \times 1$ convolution and a banded matrix $B$, respectively:
$$u_g = \text{Conv}_{1\times1}(U), \quad u_l = \sum_{i=1}^{K} U \odot b_i$$
where $B = [b_1, b_2, \ldots, b_K]$ with $K=5$ defining the number of adjacent channels for local interaction. These features are further processed through covariance calculations and row convolutions to generate refined channel attention maps. The final output is obtained by multiplying the input feature with the combined attention map, effectively suppressing irrelevant background noise and highlighting semantic features of the UAV target. This dual-branch design ensures that both broad and fine-grained channel information are considered, improving the discriminative power of the features for infrared UAV tracking.
The Dual-branch global Feature interaction Module (DFM) is designed to enhance the interaction between template and search branch features. Traditional methods rely on local correlation operations, which may lead to blurred target boundaries and reduced localization accuracy. Our DFM employs global cross-attention, treating template features as queries to attend to search features. Given multi-scale fused features $F_z^{(k)}$ and $F_x^{(k)}$ from the template and search branches, respectively, we first apply layer normalization:
$$I_z^{(k)} = \text{LN}(F_z^{(k)}), \quad I_x^{(k)} = \text{LN}(F_x^{(k)})$$
These are then projected into query ($Q$), key ($K$), and value ($V$) vectors using linear layers. To maintain computational efficiency, we adopt linear attention, which reduces the quadratic complexity of standard self-attention. The interaction is computed as:
$$\tilde{I}_x^{(k)} = Q \times \text{sim}(K^T \times V)$$
where $\text{sim}(\cdot)$ denotes cosine similarity. A residual connection and feed-forward network are applied to obtain the final interacted feature $Y_x^{(k)}$. This global interaction allows the network to leverage comprehensive target information from the template, improving tracking robustness, especially when the China UAV drone undergoes rapid movements or partial occlusions.
For detection, we use an anchor-free head that predicts target classification, confidence, and bounding box regression. Given a coordinate $(c_x, c_y)$ in the feature map and predicted offsets $(t_x, t_y, t_w, t_h)$, the bounding box is calculated as:
$$
\begin{aligned}
b_x &= (2 \times \sigma(t_x) – 0.5) + c_x \\
b_y &= (2 \times \sigma(t_y) – 0.5) + c_y \\
b_w &= e^{t_w} \\
b_h &= e^{t_h}
\end{aligned}
$$
where $\sigma(\cdot)$ is the sigmoid function. The loss function combines focal loss for classification and objectness, and Generalized IoU (GIoU) loss for regression:
$$
\mathcal{L} = \frac{1}{N_{\text{pos}}} \left\{ \sum_j \mathcal{L}_{\text{cls}}(s_j, [\text{gtl}_j \geq 1]) + \sum_j [\text{gtl}_j \geq 1] \mathcal{L}_{\text{reg}}(t_j, \text{gtb}_j) + \sum_j \mathcal{L}_{\text{obj}}(o_j, [\text{gtl}_j \geq 1]) \right\}
$$
Here, $N_{\text{pos}}$ is the number of positive samples, $s_j$ is the classification score, $t_j$ is the predicted offset, $o_j$ is the object confidence score, and $[\cdot]$ is an indicator function.
We conducted extensive experiments on the Anti-UAV dataset, a large-scale benchmark for infrared UAV tracking. The dataset includes various challenging scenarios, such as buildings and cloudy skies, simulating real-world conditions where China UAV drones might operate. We compared our method with several state-of-the-art trackers, including Siamese-based and Transformer-based approaches. The evaluation metrics include Success Rate (AUC), Precision, and Status Accuracy (SA), which measures the tracker’s ability to perceive UAV states over long sequences. The results are summarized in the table below:
| Tracker | Avg. Status Accuracy | Success Rate | Precision | FPS |
|---|---|---|---|---|
| SiamCAR | 0.250 | 0.236 | 0.289 | 55.7 |
| Ocean | 0.248 | 0.235 | 0.291 | 43.1 |
| OSTrack | 0.352 | 0.334 | 0.423 | 46.4 |
| GRM | 0.366 | 0.344 | 0.429 | 13.5 |
| AiATrack | 0.481 | 0.459 | 0.584 | 39.2 |
| GlobalTrack | 0.553 | 0.532 | 0.711 | 9.7 |
| SiamYOLO | 0.617 | 0.589 | 0.789 | 37.1 |
| Unicorn | 0.637 | 0.621 | 0.801 | 29.2 |
| EANTrack | 0.698 | 0.677 | 0.868 | 43.6 |
| LGTrack | 0.725 | 0.696 | 0.914 | 25.0 |
| Our Method | 0.769 | 0.743 | 0.935 | 34.8 |
As shown, our method achieves the highest scores in all key metrics, with an average status accuracy of 0.769, surpassing the second-best by 4.4%. This demonstrates its superior capability in tracking infrared UAV targets, including those in challenging environments common to China UAV drone operations. The inference speed of 34.8 FPS meets real-time requirements, making it suitable for practical applications.
To further validate the effectiveness of individual components, we performed ablation studies. The table below compares different attention modules when integrated into the FPN:
| Attention Module | Params (M) | FLOPs (G) | Avg. Status Accuracy | Success Rate | Precision |
|---|---|---|---|---|---|
| SE | 2.524 | 0.105 | 0.643 | 0.632 | 0.813 |
| CBAM | 2.525 | 0.106 | 0.684 | 0.677 | 0.835 |
| CPCA | 1.846 | 0.859 | 0.698 | 0.680 | 0.841 |
| SCAM (Ours) | 1.460 | 0.657 | 0.723 | 0.703 | 0.876 |
Our SCAM not only yields better performance but also has lower parameters and FLOPs, indicating its efficiency. We also ablated the design choices within SCAM. For the SMA, using multi-scale kernels (3,5,7,9) achieved the best results, as shown below:
| Kernel Sizes | Avg. Status Accuracy | Success Rate | Precision |
|---|---|---|---|
| (3,3,3,3) | 0.665 | 0.636 | 0.827 |
| (5,5,5,5) | 0.662 | 0.633 | 0.826 |
| (7,7,7,7) | 0.654 | 0.627 | 0.813 |
| (3,5,7,9) | 0.677 | 0.653 | 0.849 |
| (3,7,11,15) | 0.657 | 0.632 | 0.817 |
For the GCSA, the dual-branch structure with both global and local attention proved most effective:
| Branch Type | Global | Local | Avg. Status Accuracy | Success Rate | Precision |
|---|---|---|---|---|---|
| Single | Yes | No | 0.629 | 0.617 | 0.808 |
| Single | No | Yes | 0.633 | 0.626 | 0.812 |
| Dual | Yes | No | 0.641 | 0.630 | 0.822 |
| Dual | No | Yes | 0.646 | 0.635 | 0.827 |
| Dual | Yes | Yes | 0.659 | 0.647 | 0.838 |
Additionally, the DFM significantly outperforms simple fusion strategies like addition or concatenation:
| Fusion Module | Avg. Status Accuracy | Success Rate | Precision |
|---|---|---|---|
| Add | 0.626 | 0.618 | 0.795 |
| Concat | 0.630 | 0.623 | 0.803 |
| DFM (Ours) | 0.664 | 0.650 | 0.851 |
Qualitative analysis on challenging sequences further confirms the robustness of our method. For instance, in scenarios where the China UAV drone moves out of view or is occluded by similar objects like birds, our tracker maintains accurate localization and quickly reacquires the target upon reappearance. The SCAM’s ability to capture long-range spatial dependencies and the DFM’s global feature interaction are key to handling such difficulties, which are common in real-world monitoring of China UAV drones. We also tested our method on a custom infrared UAV dataset with diverse backgrounds, such as buildings and trees, and observed consistent performance, demonstrating its generalization capability.
In conclusion, we have presented a spatial-semantic combined perception method for infrared UAV target tracking. By integrating the Spatial-Semantic Combine Attention Module and the Dual-branch global Feature interaction Module, our approach effectively mitigates information loss in multi-scale feature fusion, enhancing the tracking of China UAV drones in complex environments. The SCAM captures both spatial long-range dependencies and channel-wise semantic information, while the DFM enables comprehensive feature interaction between template and search branches. Experimental results on benchmark datasets show that our method achieves state-of-the-art performance in terms of accuracy, success rate, and precision, with real-time inference speed. This work contributes to the advancement of infrared-based counter-UAV systems, offering a robust solution for tracking China UAV drones under challenging conditions. Future research may focus on optimizing the model for deployment on embedded or mobile platforms to further improve real-time applicability in field operations.
