A UAV Object Tracking Algorithm Based on Frequency-Domain Feature and Transformer

With the rapid advancement of UAV (Unmanned Aerial Vehicle) technology, applications in surveillance, reconnaissance, and delivery have proliferated. A core enabling technology for these applications is robust visual object tracking, which allows a drone to autonomously follow a target of interest. However, UAV-based tracking presents unique and severe challenges not commonly found in ground-based scenarios. The dynamic motion of the drone itself, coupled with complex backgrounds like urban landscapes or dense foliage, often leads to frequent target occlusion, deformation, significant scale variation, and drastic changes in viewing perspective. These factors can easily cause traditional tracking algorithms to fail or drift, highlighting the critical need for stable and efficient tracking algorithms specifically designed for the UAV domain.

In recent years, deep learning models have revolutionized computer vision, including object tracking. Among them, Transformer architectures, with their powerful self-attention mechanisms, have shown remarkable success in capturing long-range dependencies and global context. Trackers like OSTrack, MixFormer, and SwinTrack leverage Transformers to model the intricate relationship between a target template and a search region, achieving state-of-the-art performance. However, these methods primarily focus on global feature interaction through attention, often lacking explicit modeling of local structural details such as edges and textures. This limitation can impair performance when the UAV-captured target undergoes partial occlusion, non-rigid deformation, or when its appearance details are crucial for discrimination in cluttered environments. Furthermore, while online template update strategies are employed to adapt to appearance changes, they often rely on complex confidence estimation modules, increasing computational cost and model complexity, which is a concern for real-time drone operations.

To address the limitations of global attention and to enhance the model’s ability to handle the specific challenges of UAV tracking, we integrate insights from frequency-domain analysis. The frequency domain provides a complementary perspective to the spatial domain, where high-frequency components often correspond to edges and fine textures, while low-frequency components represent smoother areas and global structure. By operating in the frequency domain, we can more directly manipulate and enhance these informative features. Inspired by this, we propose a novel UAV object tracking algorithm that synergistically combines a distilled Transformer backbone with an Adaptive Frequency-Domain Perception Network. The core innovation lies in a multi-view invariant feature learning strategy driven by mutual information maximization, enabling robust tracking under perspective changes commonly encountered by a moving drone.

The proposed framework consists of several key components. First, we utilize a lightweight Transformer encoder, distilled from a larger model, to efficiently extract rich global spatial features from the template, search region, and an additional “learned image.” This learned image is initialized from the search region and serves as a dynamic representation that captures the correlation between the target and its context, providing a more robust template for tracking. Second, the extracted feature sequences are fed into our novel Adaptive Frequency-Domain Perception Network. This module applies a 2D Fast Fourier Transform (FFT) to project the features into the frequency domain. Here, an adaptive learnable filter, initialized and refined using both the template and learned image features, performs element-wise weighting to amplify target-related frequencies (e.g., edge components) and suppress background noise. An inverse FFT then projects the enhanced features back to the spatial domain for subsequent processing.

The third and pivotal component is the multi-view invariant feature learning strategy. A fundamental challenge for a UAV is maintaining a consistent target representation despite significant viewpoint changes. We formulate this as a mutual information maximization problem between the template feature set and the corresponding target area within the search feature set. By maximizing their mutual information, we encourage the network to learn features that are invariant to the viewpoint changes induced by the drone’s movement. We employ the MINE (Mutual Information Neural Estimation) estimator to efficiently compute this mutual information loss, denoted as $S_{vir}$, which is added to the overall training objective. This strategy does not add overhead during inference but significantly improves robustness during training.

The overall training loss combines the standard tracking losses with our novel mutual information loss:

$$ L_{total} = \lambda_{L1}L_1 + \lambda_{GIoU}L_{GIoU} + \lambda_{cls}L_{cls} + \lambda_{vir}S_{vir} $$

where $L_1$ and $L_{GIoU}$ are bounding box regression losses, $L_{cls}$ is the Gaussian-weighted focal classification loss, and $\lambda$ terms are balancing weights. During inference, the final target bounding box is predicted based on the output features from the network head.

We conduct extensive experiments on major UAV and general tracking benchmarks. The proposed method is evaluated on UAV123, OTB100, and LaSOT datasets. The performance is measured using standard metrics: Precision (P), Success Rate (SR), and Average Overlap (AO). We also report frames per second (FPS) to assess practical usability on drone platforms.

An ablation study on GOT-10K validates the contribution of each component, as summarized in the table below:

Exp.	Baseline	Freq. Net	MI Loss	Learn. Image	AO (%)	SR_0.5 (%)	Params (M)	FPS
1	Yes	No	No	No	71.7	80.2	20	165
2	Yes	Yes	No	No	73.3	83.4	26	152
3	Yes	Yes	Yes	No	74.5	84.8	27	151
4	No	Yes	Yes	Yes	74.6	84.2	24	156

The results clearly show that introducing the frequency-domain network (Exp. 2) brings a solid gain over the baseline. Adding the mutual information loss for multi-view invariance (Exp. 3) provides a further significant boost. Finally, replacing the complex online update with the simple yet effective learned image mechanism (Exp. 4) maintains high performance while reducing parameters and increasing speed, making it ideal for UAV applications.

We compare our method with several state-of-the-art trackers, including those designed for UAVs. The comparative results on multiple benchmarks are shown below:

Tracker	UAV123 (Prec.)	UAV123 (Succ.)	LaSOT (Prec.)	LaSOT (Succ.)	FPS
Ours	90.2	68.8	66.5	64.0	156
SpectralTracker	78.2	67.4	69.1	65.7	52
SMAT	81.8	64.6	64.6	61.7	124
TCTrack	80.8	60.6	–	–	139
AutoTrack	68.9	47.2	–	–	58
FEAT-L	86.6	65.8	60.9	57.9	80

The proposed algorithm achieves leading or highly competitive performance across all datasets. Notably, on the challenging UAV123 benchmark, which contains numerous sequences with fast motion and viewpoint changes, our method significantly outperforms others in precision, demonstrating its effectiveness for drone-based tracking. The maintained high FPS indicates its suitability for real-time deployment on UAV platforms.

In conclusion, this work presents a robust and efficient object tracking algorithm tailored for the demanding environment of UAV operations. The main contributions are threefold: 1) A novel feature extraction network that integrates a distilled Transformer for global context with an Adaptive Frequency-Domain Perception Network to capture crucial local detail features, enhancing robustness to occlusion and deformation. 2) A multi-view invariant feature learning strategy based on mutual information maximization, which effectively guides the network to learn stable representations against perspective changes caused by drone movement. 3) A streamlined template update mechanism via a learned image, which reduces complexity while preserving adaptability. Extensive experiments confirm that the proposed algorithm achieves superior tracking accuracy and robustness for UAV drone applications while maintaining high inference speed, offering a practical solution for real-world drone tracking tasks.