Lightweight Infrared UAV Target Tracking Algorithm Based on Sparse Attention

With the rapid advancement of drone technology, Unmanned Aerial Vehicles (UAVs) have become integral in various fields such as aerial photography, surveillance, and data collection. However, the proliferation of consumer drones poses significant threats to public safety and critical infrastructure. In thermal infrared imagery, drones often appear as small, low-resolution objects embedded in complex backgrounds with high noise levels, making reliable detection and tracking particularly challenging. Traditional methods, including visible light video, acoustic waves, and radar-based radio frequency signals, suffer from limited detection range and high costs, especially when dealing with micro-UAVs in urban environments. In contrast, thermal infrared imaging offers advantages like all-weather operation and low cost, making it a promising solution for UAV detection. Despite progress in deep learning-based object tracking, many state-of-the-art models incur high computational costs, hindering deployment on resource-constrained edge devices. This paper addresses these issues by proposing a lightweight enhanced SiamDT network, termed SiamDT-NG, which incorporates Ghost convolution and Native Sparse Attention (NSA) mechanisms to improve tracking accuracy while reducing computational overhead.

Recent years have seen significant breakthroughs in object tracking leveraging deep learning. For instance, Zhao et al. introduced an Anti-decay Long Short-Term Memory (AD-LSTM) network that mitigates feature degradation in long-term tracking by using dual-cell units. Wang et al. proposed a Target-Aware Attention (TANet) framework that combines local and global search with adversarial learning to generate spatiotemporally consistent attention maps. Shen et al. applied Transformer architectures to object tracking, developing NTTrRT and NTTrCT variants that use multi-head attention to capture non-stationary motion features. While these methods enhance accuracy, they often come with increased computational complexity. Siamese network-based trackers, such as SiamFC, SiamRPN, and SiamDT, strike a balance between accuracy and efficiency. SiamDT, specifically designed for infrared UAV tracking, incorporates a Dual-Semantic Region Proposal Network (DS-RPN) and a Versatile Region Convolutional Neural Network (VR-CNN) to handle small targets and background clutter. However, SiamDT’s high computational demands limit its practicality for edge deployment. Our work builds on SiamDT by integrating lightweight components to optimize performance for infrared UAV tracking.

The proposed SiamDT-NG model enhances the original SiamDT architecture through two key modifications: Ghost convolution and NSA. Ghost convolution reduces computational costs by generating intrinsic feature maps with a few conventional convolutions and then deriving additional “ghost” feature maps via linear transformations. This approach significantly cuts down parameters and floating-point operations (FLOPs) without compromising performance. For an input of size $C \times H \times W$ and output of size $N \times H’ \times W’$, the computational ratio $r_s$ between conventional and Ghost convolution is given by:

$$ r_s = \frac{N \cdot H’ \cdot W’ \cdot C \cdot k \cdot k}{\frac{N}{s} \cdot H’ \cdot W’ \cdot C \cdot k \cdot k + (s-1) \cdot \frac{N}{s} \cdot H’ \cdot W’ \cdot d \cdot d} \approx \frac{s \cdot C \cdot k \cdot k}{C \cdot k \cdot k + (s-1) \cdot d \cdot d} $$

where $k$ is the kernel size, $d$ is the linear transformation kernel size, and $s$ is the number of transformations. Similarly, the parameter ratio $r_c$ is:

$$ r_c = \frac{N \cdot C \cdot k \cdot k}{\frac{N}{s} \cdot C \cdot k \cdot k + (s-1) \cdot \frac{N}{s} \cdot d \cdot d} \approx \frac{s \cdot C \cdot k \cdot k}{C \cdot k \cdot k + (s-1) \cdot d \cdot d} $$

The NSA mechanism improves attention efficiency by employing a dynamic hierarchical sparse strategy. It involves coarse-grained compression, fine-grained selection, a sliding window, and gated fusion. In coarse-grained compression, the sequence is divided into blocks, and key (K) and value (V) vectors are compressed to represent global information:

$$ K_{\text{cmp}}^{(t)} = \left\{ \phi(k_{i}) \mid 1 \leq i \leq l \right\}, \quad \text{for } t = \text{id} \cdot d \text{ to } (\text{id}+1) \cdot d – 1 $$

where $\phi(\cdot)$ is a compression operator, $l$ is the block length, and $d$ is the compression stride. Fine-grained selection scores each segment’s K and V using selection attention:

$$ p_{\text{slc}} = \text{softmax}\left( \frac{Q K_{\text{cmp}}^T}{\sqrt{d_k}} \right) $$

The sliding window mechanism handles local context by defining windowed K and V as:

$$ K_{\text{win}}^{(t)} = k_{t-w:t}, \quad V_{\text{win}}^{(t)} = v_{t-w:t} $$

where $w$ is the window size. Finally, the outputs are fused via a gating mechanism:

$$ \text{Attn}(Q, K, V) = \sum_{c \in \{\text{cmp,slc,win}\}} g_c \cdot \text{Attn}_c(Q, K_c, V_c) $$

where $g_c$ are learnable gating scores. These components are integrated into the DS-RPN and VR-CNN modules of SiamDT, resulting in a lightweight yet accurate tracker for infrared UAV targets.

Experiments were conducted on the Anti-UAV410 dataset, which includes 410 infrared videos split into training (200 videos), validation (90 videos), and test (120 videos) sets. Each video averages 1069 frames, covering diverse scenarios such as day/night conditions and seasonal variations. The hardware platform consisted of an Intel Core i7-9700k CPU and an NVIDIA GeForce RTX3080Ti GPU, with software environment including Ubuntu 20.01, CUDA 11.1, Python 3.7, and PyTorch 1.9.1. Hyperparameters included an SGD optimizer with an initial learning rate of 0.01, batch size of 4, and 20 epochs. Evaluation followed the One-Pass Evaluation (OPE) protocol, using success rate, precision, and State Accuracy (SA) metrics. Success rate is based on Intersection over Union (IoU):

$$ \text{IoU} = \frac{B_{\text{gt}} \cap B_{\text{pred}}}{B_{\text{gt}} \cup B_{\text{pred}}} $$

where $B_{\text{gt}}$ is the ground truth bounding box and $B_{\text{pred}}$ is the predicted box. Precision measures the Euclidean distance $\rho$ between centers:

$$ \rho = \sqrt{(x_{\text{gt}} – x_{\text{pred}})^2 + (y_{\text{gt}} – y_{\text{pred}})^2} $$

SA combines localization and state prediction:

$$ \text{SA} = \frac{1}{T} \sum_{t} \left[ \text{IoU}_t \cdot \delta(v_t > 0) + (1 – \delta(v_t > 0)) \cdot p_t \right] $$

where $v_t$ is visibility flag, $p_t$ is predicted visibility, and $\delta$ is an indicator function.

Comparative experiments with SiamFC, SiamCAR, SwinTrack-Base, MixFormerV2-B, and SiamDT show that SiamDT-NG achieves superior performance. Under OPE, SiamDT-NG attains a success rate AUC of 66.60% and precision of 90.20% at a 20-pixel threshold, outperforming SiamDT by 6.20% in both metrics. SA reaches 67.93%, a 6.38% improvement over SiamDT, while reducing FLOPs to 62.65 G and parameters to 38.50 M. The results are summarized in Table 1.

Table 1: Comparative Results of Different Trackers on Anti-UAV410 Dataset
Tracker	FLOPs (G)	Parameters (M)	SA (%)
SiamFC	3.19	2.34	47.32
SiamCAR	54.83	28.56	46.93
SwinTrack-Base	139.40	90.89	55.74
MixFormerV2-B	253.86	58.80	59.65
SiamDT	120.53	46.63	61.55
SiamDT-NG	62.65	38.50	67.93

Ablation studies confirm the contributions of each component. As shown in Table 2, adding NSA alone increases SA to 67.50% with minimal computational overhead, while Ghost convolution alone reduces FLOPs to 62.57 G and parameters to 37.83 M while improving SA to 62.61%. Combining both modules yields the best results, with SA of 67.93%, FLOPs of 62.65 G, and parameters of 38.50 M.

Table 2: Ablation Study Results on SiamDT-NG Components
Experiment	NSA	GhostConv	FLOPs (G)	Parameters (M)	SA (%)
1 (Baseline)	×	×	120.53	46.63	61.55
2	√	×	120.61	47.31	67.50
3	×	√	62.57	37.83	62.61
4 (SiamDT-NG)	√	√	62.65	38.50	67.93

Visualization results across various scenarios, such as urban environments, suburban areas, and high-clutter backgrounds, demonstrate SiamDT-NG’s robustness in tracking small UAV targets under challenging conditions. For example, in scenes with occlusions or dynamic backgrounds, the model maintains accurate bounding boxes where other trackers fail. This highlights its potential for real-world applications in drone technology, where Unmanned Aerial Vehicle detection must be efficient and reliable.

In conclusion, this paper presents SiamDT-NG, a lightweight infrared UAV target tracking algorithm that integrates Ghost convolution and NSA mechanisms. The model achieves a balance between accuracy and computational efficiency, making it suitable for deployment on edge devices. Experimental results on the Anti-UAV410 dataset show significant improvements in SA and reductions in FLOPs and parameters compared to baseline methods. However, limitations remain, such as moderate computational costs relative to ultra-lightweight trackers and potential performance fluctuations in extreme scenarios like severe occlusion. Future work will focus on further optimizing the network architecture and enhancing robustness to handle diverse environmental conditions in drone technology applications. The advancements in Unmanned Aerial Vehicle tracking provided by this approach contribute to safer and more efficient airspace management.