Advancing Visual Anti-UAV Tracking with Attention and Refinement

The proliferation of Unmanned Aerial Vehicles (UAVs), while offering immense benefits, has introduced significant security risks to civil aviation, critical infrastructure, and public safety. Effective countermeasures, termed anti-UAV systems, are therefore of paramount importance. Among various detection modalities—including audio, radar, and radio frequency—vision-based anti-UAV tracking presents a promising, widely deployable solution. However, it is a task fraught with severe challenges. UAVs are typically small, fast-moving targets that operate against cluttered, dynamic backgrounds like skylines, forests, and urban areas. They frequently undergo drastic scale changes, experience motion blur, become partially or fully occluded, and can even fly out of the camera’s field of view. Furthermore, the requirement for persistent, long-term tracking adds another layer of complexity, demanding algorithms that can recover from failures and drifts.

To address these formidable challenges in visual anti-UAV operations, we propose a robust tracking framework. Our work builds upon the Siamese network paradigm, known for its good balance between accuracy and speed. We introduce two key algorithmic enhancements: a Global Attention Mechanism (GAM) to fortify feature representation against complex backgrounds, and an Alpha-Refine (AR) module for precise bounding box estimation. The core algorithm integrating these components is named SiamGR. Recognizing the limitations of pure trackers in long-term scenarios with target disappearance, we further develop a variant, SiamGR-FR, which strategically fuses a general object detector (Faster R-CNN) to enable target re-localization and tracker resetting. Comprehensive evaluations on the specialized DUT Anti-UAV dataset demonstrate that our methods, particularly SiamGR-FR, establish new state-of-the-art performance, offering a potent reference solution for practical video-based anti-UAV applications.

The Challenges and the Benchmark for Anti-UAV Tracking

Developing robust vision-based anti-UAV systems requires confronting a specific set of environmental and target-based difficulties not always prominent in generic tracking benchmarks. The primary hurdles include:

Small Target Size & Low Resolution: Drones often occupy a minuscule number of pixels, especially at long ranges, providing limited discriminative features.
Complex Background Clutter (Background Transformation): A UAV may fly from a clear sky into a forested area or behind architectural structures, requiring the tracker to distinguish it from highly similar background patches.
Severe Occlusion: Temporary occlusion by trees, buildings, or other objects is common, during which the tracker must maintain the target’s state.
Abrupt Motion & Motion Blur: Drones can change direction and speed quickly, leading to motion blur in video frames that obscures target appearance.
Out-of-View & Long-Term Tracking: A critical anti-UAV scenario involves the target leaving the camera’s frame and potentially re-entering later. A tracker must either gracefully handle this or be re-initialized upon re-appearance.
Scale Variation: As the drone approaches or recedes, its apparent size changes dynamically.

To foster research in this domain, the DUT Anti-UAV dataset was introduced. It is a large-scale, visible-light dataset specifically curated for anti-UAV tracking and detection. It contains 20 challenging video sequences (24,804 frames in total) captured in diverse low-altitude environments, explicitly annotated with attributes for the nine challenges listed above. This dataset serves as the standard benchmark for evaluating an algorithm’s resilience in real-world anti-UAV conditions.

Methodology: The SiamGR and SiamGR-FR Framework

Our approach is grounded in the anchor-free Siamese tracker SiamCAR, chosen for its simplicity and effectiveness. We sequentially augment this baseline with modules designed to tackle specific anti-UAV weaknesses.

Backbone Enhancement with Global Attention Mechanism (GAM)

To improve feature discriminability in cluttered anti-UAV scenes, we integrate the Global Attention Mechanism (GAM) into the backbone network (ResNet-50). GAM aims to reduce information reduction and amplify global cross-dimension interactions. It operates sequentially through channel and spatial attention sub-modules.

Given an input feature map $F_1 \in \mathbb{R}^{C \times H \times W}$, the channel attention module $M_c$ first applies a 3D permutation to preserve information across dimensions, followed by a multi-layer perceptron (MLP) with a reduction ratio $r$ to model channel-wise dependencies:

$$F_2 = M_c(F_1) \otimes F_1$$

Subsequently, the spatial attention module $M_s$ focuses on spatial information using two convolutional layers for fusion, deliberately avoiding pooling operations to retain details:

$$F_3 = M_s(F_2) \otimes F_2$$

Here, $\otimes$ denotes element-wise multiplication. By embedding GAM in the deeper layers of the backbone, we enhance the model’s capacity to focus on the drone target while suppressing irrelevant background noise, a crucial capability for anti-UAV tracking.

Precise Localization with Alpha-Refine (AR) Module

Accurate bounding box estimation is vital, especially for small UAV targets. The standard SiamCAR head may produce coarse predictions. We append an Alpha-Refine (AR) module to refine these initial results. AR operates on a smaller search region (2x the target size instead of the common 4x), which reduces background distraction and computation.

Its core innovation is pixel-wise correlation. Let $K \in \mathbb{R}^{C \times H_0 \times W_0}$ be the template feature and $S \in \mathbb{R}^{C \times H \times W}$ be the search feature. $K$ is decomposed into $H_0W_0$ tiny kernels $K_j \in \mathbb{R}^{C \times 1 \times 1}$. A correlation map $C \in \mathbb{R}^{H_0W_0 \times H \times W}$ is computed as:

$$C = \{ C_j | C_j = K_j * S \}_{j \in \{1,2,\dots,H_0 \times W_0\}}$$

where $*$ denotes naive correlation. This ensures each slice of the correlation map encodes information for a specific local region of the target, preserving fine-grained spatial details lost in standard deep correlation.

The AR prediction head is a lightweight “corner head.” It consists of several stacked convolutional layers that output two heatmaps predicting the top-left and bottom-right corners of the target, leading to more accurate box coordinates for the often compact anti-UAV target.

Fusion with Detection for Long-Term Anti-UAV Tracking: SiamGR-FR

For the critical long-term anti-UAV scenarios involving out-of-view events or prolonged occlusion, a tracker-alone strategy is prone to irreversible drift. Our variant, SiamGR-FR, introduces a detection-assisted recovery mechanism. It employs a “tracking-by-detection” strategy where a Faster R-CNN detector, specifically trained on drone data, acts as a sentinel.

The algorithm workflow is as follows: The SiamGR tracker processes every frame. Periodically (e.g., every 60 frames), the Faster R-CNN detector performs a global scan of the current frame, producing a set of candidate bounding boxes $\{d_{bboxes}\}$ with confidence scores $\{d_{scores}\}$. If the maximum detection score $d_{score}^{max}$ exceeds a pre-defined threshold $T_s$, the corresponding detection box is used to update the tracker’s output and can reset the tracker’s internal state. Otherwise, the tracker’s prediction is trusted. This fusion allows SiamGR-FR to re-acquire a lost anti-UAV target, dramatically improving robustness in long-term sequences.

Experimental Evaluation and Analysis

We conduct extensive experiments on the DUT Anti-UAV dataset under the One-Pass Evaluation (OPE) protocol. Performance is measured by Success Rate (the area under the curve of overlap ratio vs. threshold) and Precision Rate (the percentage of frames where the center location error is below a threshold, typically 20 pixels).

Overall Performance Comparison

We compare SiamGR and SiamGR-FR against several strong baselines, including SiamRPN++, DiMP, SiamAttn, SiamCAR, SiamAPN++, and SiamRPN++-RBO. The results unequivocally demonstrate the effectiveness of our proposed methods in the anti-UAV context.

Overall Performance on DUT Anti-UAV Dataset
Tracker	Success Rate	Precision Rate
SiamRPN	0.392	0.701
SiamAPN++	0.426	0.661
SiamCAR (Baseline)	0.562	0.808
SiamAttn	0.568	0.810
DiMP	0.577	0.830
SiamRPN++-RBO	0.589	0.828
SiamGR (Ours)	0.615	0.842
SiamGR-FR (Ours)	0.662	0.946

SiamGR alone achieves a success rate of 0.615 and a precision of 0.842, outperforming the baseline SiamCAR by 5.3% and 3.4%, respectively, and ranking second among all trackers. This validates the contribution of the GAM and AR modules. The SiamGR-FR variant sets a new state-of-the-art, with a remarkable success rate of 0.662 and a precision rate of 0.946, surpassing the baseline by 10.0% and 13.8%. This leap in performance, especially in precision, highlights the transformative impact of integrating a detector for recovery in challenging anti-UAV tracking tasks.

Ablation Study

To dissect the contribution of each component, we perform an ablation study starting from the SiamCAR baseline. The results are summarized below:

Ablation Study on Component Contributions
Method	Success Rate	Precision Rate
SiamCAR	0.562	0.808
+ GAM	0.577 (+1.5%)	0.825 (+1.7%)
+ Alpha-Refine	0.592 (+3.0%)	0.818 (+1.0%)
+ Faster R-CNN (FR)	0.644 (+8.2%)	0.923 (+11.5%)
SiamGR-FR (Full)	0.662	0.946

The study confirms that both GAM and Alpha-Refine provide consistent, complementary gains. The fusion with the detector (FR) yields the most substantial individual improvement, underscoring its importance for handling anti-UAV failure modes. The full SiamGR-FR model combines these benefits synergistically.

Performance Under Specific Anti-UAV Challenges

A key advantage of the DUT Anti-UAV dataset is its fine-grained attribute annotations. We analyze how different algorithms perform under each specific challenge inherent to anti-UAV operations. The following table summarizes the success rates for key challenges:

Success Rate Analysis Across Different Anti-UAV Challenges
Challenge	SiamCAR	SiamGR	SiamGR-FR	Best Competitor
Background Transformation	0.579	0.630	0.652	0.606 (SiamRPN++-RBO)
Scale Variation	0.513	0.575	0.632	0.548 (SiamRPN++-RBO)
Low Resolution	0.566	0.678	0.731	0.638 (DiMP)
Out-of-View	0.400	0.481	0.613	0.524 (DiMP)
Fast Motion	0.519	0.612	0.652	0.563 (DiMP)
Small Target	0.565	0.619	0.629	0.586 (SiamRPN++-RBO)
Long-Term Tracking	0.496	0.550	0.618	0.519 (SiamRPN++-RBO)
Occlusion	0.518	0.562	0.618	0.528 (DiMP)

The analysis reveals that SiamGR consistently outperforms the baseline SiamCAR across all challenges, proving the general effectiveness of our attention and refinement modules. Most strikingly, SiamGR-FR dominates every single challenge category, achieving the highest success rate. Its performance is particularly superior in scenarios like “Out-of-View” and “Long-Term Tracking,” where the detector-based recovery mechanism is directly exploited. This comprehensive superiority under diverse anti-UAV-specific conditions solidifies the practical value of our proposed framework.

Qualitative Results

Visual inspection of tracking sequences further illustrates the strengths of our approach. In sequences where the drone undergoes rapid background changes (e.g., from sky to buildings), SiamGR maintains more stable tracking than SiamCAR, thanks to the GAM-enhanced features. The bounding boxes from SiamGR are also noticeably tighter due to the AR module. In sequences where the drone flies out of the frame and later re-appears, or in long-term sequences with full occlusion, SiamCAR and eventually SiamGR lose the target and drift. In contrast, SiamGR-FR successfully re-detects the drone using the Faster R-CNN module and resets the tracker, recovering accurate tracking. These observations align perfectly with the quantitative results, demonstrating robust anti-UAV tracking capability in real-world scenarios.

Conclusion and Future Work

In this work, we have addressed the critical and challenging problem of vision-based anti-UAV tracking. We proposed SiamGR, a novel tracker that strengthens feature representation in complex backgrounds via a Global Attention Mechanism and achieves precise target localization through an Alpha-Refine module. To tackle the inherent limitations of trackers in long-term scenarios, we introduced SiamGR-FR, a powerful variant that fuses a dedicated object detector for periodic target verification and recovery. Extensive experiments on the demanding DUT Anti-UAV benchmark demonstrate that our methods, particularly SiamGR-FR, set a new state-of-the-art, significantly outperforming existing trackers across a multitude of specific anti-UAV challenges such as background clutter, scale change, low resolution, out-of-view events, and long-term tracking.

This research provides a strong reference point for developing practical video-based anti-UAV systems. For future work, several promising directions exist. First, the tracking framework itself can be further improved, for instance, by exploring more advanced attention mechanisms or temporal modeling to better handle severe occlusion. Second, the fusion strategy between the tracker and detector can be refined; an adaptive mechanism that triggers detection based on tracker confidence, rather than a fixed interval, could improve efficiency. Finally, testing the proposed fusion paradigm with other state-of-the-art trackers and detectors could lead to even more robust and generalizable anti-UAV solutions.