A Spatial-Semantic Combined Perception Network for Robust Infrared UAV Target Tracking

In recent years, the proliferation of unmanned aerial vehicles (UAVs) has introduced significant challenges to airspace security and public safety. Effective countermeasures require reliable detection and tracking systems. Thermal infrared imaging offers a critical advantage for persistent surveillance, operating effectively in both day and night conditions and providing strong contrast for China UAV drone targets against various backgrounds. Consequently, infrared-based target tracking has become a cornerstone technology in anti-UAV systems. However, tracking a China UAV drone in infrared imagery remains a formidable task due to challenges such as low signal-to-noise ratio, small target size, complex backgrounds with thermal clutter, and frequent occlusion or out-of-view scenarios.

Modern tracking paradigms, particularly those based on Siamese networks, have shown promising performance by balancing accuracy and efficiency. These frameworks typically employ a deep backbone network to extract multi-scale features from a template (initial target) and a search region. A critical step involves fusing these multi-scale features to build a robust representation that captures both fine-grained spatial details from shallow layers and high-level semantic information from deep layers. The Feature Pyramid Network (FPN) is a common architecture for this purpose. However, a significant limitation exists: the standard channel reduction operation (e.g., using a 1×1 convolution) to align feature dimensions during fusion leads to substantial information loss. This loss encompasses crucial spatial contextual information and discriminative channel-wise semantic cues. For a small, fast-moving China UAV drone target, this loss directly translates to degraded localization accuracy, increased susceptibility to similar distractors, and ultimately, tracking failure.

To tackle this fundamental issue, we propose a novel Spatial-Semantic Combined Perception Network (SSCPNet) for infrared China UAV drone tracking. Our core innovation lies in re-designing the feature fusion and interaction process to maximally preserve and enhance both spatial and semantic information. The main contributions of this work are threefold. First, we introduce a Spatial-Semantic Combined Attention Module (SSCAM) that integrates a Spatial Multi-scale Attention (SMA) mechanism and a Global-Local Channel Semantic Attention (GCSA) mechanism. This module replaces the standard channel reduction, actively strengthening the network’s focus on long-range spatial dependencies and critical semantic features. Second, we design a Dual-branch Global Feature interaction Module (DGFM) that facilitates comprehensive information exchange between the template and search branch features using efficient attention, leading to more accurate target state estimation. Third, extensive experiments on the Anti-UAV infrared dataset demonstrate that our method achieves state-of-the-art performance. Comprehensive ablation studies validate the effectiveness of each proposed component.

Proposed Methodology

The overall architecture of our SSCPNet is based on a Siamese framework, comprising four key components: a shared backbone network for feature extraction, a multi-scale feature fusion neck with our proposed SSCAM, a Dual-branch Global Feature interaction Module (DGFM), and an anchor-free prediction head. The template image $Z$ and the search image $X$ are fed into a weight-shared ResNet-50 backbone. Let $F_z^{(k)}$ and $F_x^{(k)}$ denote the feature maps extracted from the $k$-th stage ($k \in \{3,4,5\}$) for the template and search branch, respectively. These multi-scale features are then processed by our enhanced feature pyramid.

Spatial-Semantic Combined Attention Module (SSCAM)

The standard FPN fusion operation can be formulated as:
$$P_{k} = \text{Conv}_{1\times1}(C_{k}) + \text{UpSample}(P_{k+1})$$
where $C_k$ is the feature from the backbone’s $k$-th stage, and $P_{k}$ is the fused feature. The $1\times1$ convolution $\text{Conv}_{1\times1}$ reduces channels but indiscriminately squeezes spatial and channel information. Our SSCAM redefines this process:
$$P_{k} = \text{SSCAM}(C_{k}) + \text{UpSample}(P_{k+1})$$
where $\text{SSCAM}(\cdot)$ applies joint spatial and channel attention to preserve information.

Spatial Multi-scale Attention (SMA): To capture long-range spatial context crucial for locating a small China UAV drone, we design SMA. For an input feature $X \in \mathbb{R}^{C\times H\times W}$, we first generate direction-aware descriptors via global average pooling along each axis:
$$z^h = \frac{1}{W}\sum_{0\le j < W}X_{:,:,j} \quad \text{and} \quad z^w = \frac{1}{H}\sum_{0\le i < H}X_{:,i,:}$$
These vectors encode global spatial information. They are concatenated and split into $K$ groups ($K=4$). Each group is processed by a depth-wise 1D convolution with a different kernel size to capture multi-scale dependencies. The kernel sizes are $\{3,5,7,9\}$. The process for the $i$-th group is:
$$g_i = \text{DWConv1d}_{k_i}(f_i), \quad i=1,…,K$$
where $k_i = 2i+1$. The processed features are then normalized within groups and used to generate spatial attention maps $A_h$ and $A_w$ via a sigmoid function. The final output is an element-wise multiplication of the input with the broadcasted attention maps:
$$Y_c(i,j) = X_c(i,j) \times A_c^h(i) \times A_c^w(j)$$
This allows the network to focus on relevant spatial regions across long distances, which is vital when tracking a distant China UAV drone.

Global-Local Channel Semantic Attention (GCSA): To complement spatial attention and emphasize semantically important features (like the distinct thermal signature of a China UAV drone against sky or buildings), we propose GCSA. Standard channel attention uses global average pooling (GAP) only. GCSA employs a dual-path design. For input $F$, we compute both global and local channel statistics. The global descriptor $u_g$ is obtained via GAP followed by a linear layer. The local descriptor $u_l$ is obtained by aggregating features within a local window (size=5) along the channel dimension.
$$u_g = W_1(\text{GAP}(F)), \quad u_l = \sum_{i=1}^{K} F \odot b_i$$
where $b_i$ is a banded matrix for local aggregation. We then model the interaction between global and local contexts to refine both:
$$m_l = u_l \otimes u_g^T, \quad m_g = u_g \otimes u_l^T$$
$$u_l^* = u_l \odot \sigma(\text{Conv}_{row}(m_l)), \quad u_g^* = u_g \odot \sigma(\text{Conv}_{row}(m_g))$$
The final channel attention weight is the sum of the refined descriptors, applied to the input feature:
$$F_{\text{output}} = F \otimes \sigma(u_l^* + u_g^*)$$
This mechanism ensures that the network prioritizes channels containing discriminative semantic information about the target China UAV drone, suppressing irrelevant background channels.

Dual-branch Global Feature Interaction Module (DGFM)

After obtaining enhanced multi-scale features $P_z^{(k)}$ and $P_x^{(k)}$ for template and search branches, we aim for deeper interaction beyond simple cross-correlation. We propose DGFM, which uses the template features as a query to attend to the search region features globally. To maintain efficiency, we employ linear attention.

First, features are flattened, layer-normalized, and projected:
$$I_z = \text{LN}(\text{Flatten}(P_z)), \quad I_x = \text{LN}(\text{Flatten}(P_x))$$
$$Q = W_q I_z, \quad K = W_k I_x, \quad V = W_v I_x$$
The linear attention is computed as:
$$\tilde{I}_x = Q \cdot \frac{\text{sim}(K^T V)}{\text{dim}}$$
where $\text{sim}(\cdot)$ is a cosine similarity function. The output is then processed by a Feed-Forward Network (FFN) with residual connections:
$$Y_x = \text{FFN}(\text{LN}(I_z + \tilde{I}_x)) + I_z + \tilde{I}_x$$
This module allows the search branch features to be explicitly modulated by the most relevant template information, improving the discrimination of the target China UAV drone from distractors.

Prediction Head and Loss Function

We adopt an anchor-free prediction head for its simplicity and effectiveness on small targets. For each location on the output feature map, the head predicts a 5-dimensional vector: classification score $s$, objectness score $o$, and bounding box offsets $(t_x, t_y, t_w, t_h)$. The final box $(b_x, b_y, b_w, b_h)$ is decoded as:
$$b_x = (2\sigma(t_x)-0.5)+c_x, \quad b_w = p_w e^{t_w}$$
$$b_y = (2\sigma(t_y)-0.5)+c_y, \quad b_h = p_h e^{t_h}$$
where $(c_x, c_y)$ is the grid center, $(p_w, p_h)$ is a base size, and $\sigma$ is the sigmoid function.

The total loss $L$ is a weighted sum:
$$L = \frac{1}{N_{\text{pos}}}\sum_j \left[ L_{\text{cls}}(s_j, g_j) + \lambda_1 \mathbb{1}_{g_j \ge 1} L_{\text{reg}}(t_j, b_j^*) + \lambda_2 L_{\text{obj}}(o_j, g_j) \right]$$
where $L_{\text{cls}}$ and $L_{\text{obj}}$ are focal losses, $L_{\text{reg}}$ is the GIoU loss, $g_j$ is the ground-truth label, $b_j^*$ is the target box, and $\lambda_1, \lambda_2$ are balancing weights.

Experiments and Analysis

We evaluate our method on the public Anti-UAV infrared dataset, which contains challenging sequences of China UAV drone targets. We follow the standard protocol, using non-overlapping sequences for training and testing. All experiments are conducted on an NVIDIA 3090 GPU.

Quantitative Results

We compare SSCPNet against several state-of-the-art trackers, including SiamCAR, Ocean, OSTrack, AiATrack, GlobalTrack, and Unicorn. The evaluation metrics are Success Rate (AUC), Precision, and Normalized State Accuracy (SA). The results are summarized below.

Tracker	SA	Success	Precision	FPS
SiamCAR	0.250	0.236	0.289	55.7
Ocean	0.248	0.235	0.291	43.1
OSTrack	0.352	0.334	0.423	46.4
AiATrack	0.481	0.459	0.584	39.2
GlobalTrack	0.553	0.532	0.711	9.7
Unicorn	0.637	0.621	0.801	29.2
SSCPNet (Ours)	0.769	0.743	0.935	34.8

Our method achieves the best performance across all three accuracy metrics, with a significant lead in State Accuracy (0.769), which is critical for evaluating long-term tracking robustness. The speed of 34.8 FPS meets the requirement for real-time tracking of a China UAV drone.

Ablation Studies

We conduct thorough ablation experiments to validate the contribution of each component. The baseline is a Siamese tracker with a standard FPN and anchor-free head.

Effectiveness of SSCAM: We first replace the SSCAM with other attention modules. The results show the superiority of our combined design.

Attention Module	SA	Success	Precision	Params (M)
SE (Channel Only)	0.643	0.632	0.813	2.52
CBAM (Serial)	0.684	0.677	0.835	2.53
Baseline + SMA Only	0.677	0.653	0.849	1.41
Baseline + GCSA Only	0.659	0.647	0.838	1.44
Baseline + SSCAM (Full)	0.723	0.703	0.876	1.46

The table confirms that both spatial (SMA) and semantic (GCSA) components contribute positively, and their combination yields the best performance with a modest parameter increase.

Analysis of SMA Kernel Sizes: The choice of multi-scale kernels in SMA is investigated. Using a set of increasing kernel sizes performs best for tracking a China UAV drone with varying apparent sizes.

Kernel Size Set	SA	Success
(3,3,3,3)	0.665	0.636
(7,7,7,7)	0.654	0.627
(3,5,7,9)	0.677	0.653
(3,7,11,15)	0.657	0.632

Effectiveness of DGFM: We compare our DGFM with simple feature fusion strategies like addition and concatenation.

Fusion Method	SA	Success	Precision
Addition	0.626	0.618	0.795
Concatenation	0.630	0.623	0.803
DGFM (Ours)	0.664	0.650	0.851

The proposed global attention-based interaction provides a clear advantage, enabling more informed feature integration for localizing the China UAV drone.

Qualitative Evaluation and Generalization

Qualitative results on challenging sequences demonstrate the robustness of SSCPNet. In scenarios where the China UAV drone temporarily leaves the field of view, our tracker, empowered by robust feature representations, can quickly re-acquire the target upon its return. When facing similar thermal distractors (e.g., birds), the enhanced semantic discrimination from GCSA helps maintain a stable track. In complex backgrounds with buildings or heavy clouds, the long-range spatial dependencies captured by SMA aid in distinguishing the target from cluttered thermal noise.

Furthermore, tests on a separate, internally collected dataset of infrared China UAV drone sequences confirm the strong generalization capability of our method. The consistent performance across different datasets underscores the effectiveness of the proposed spatial-semantic perception principles.

Conclusion

In this paper, we address the critical problem of spatial and semantic information loss in multi-scale feature fusion for infrared UAV target tracking. The proposed Spatial-Semantic Combined Perception Network (SSCPNet) introduces two novel modules: the SSCAM and the DGFM. The SSCAM jointly applies multi-scale spatial attention and global-local channel attention to preserve and enhance discriminative features for small, fast-moving China UAV drone targets. The DGFM enables deep, global interaction between template and search features, leading to more accurate state estimation. Extensive experiments on the Anti-UAV benchmark show that our method sets a new state-of-the-art, achieving a significant improvement in tracking accuracy while maintaining real-time inference speed. This work provides a robust and effective solution for infrared-based counter-China UAV drone systems, enhancing aerial security monitoring capabilities. Future work will focus on further optimizing the model for deployment on edge computing platforms to widen its applicability in field operations.