Anti-Drone Object Tracking with Attention and Refinement Modules

In recent years, the proliferation of unmanned aerial vehicles (UAVs), commonly known as drones, has introduced significant benefits across various sectors, including logistics, surveillance, and entertainment. However, this widespread adoption also poses substantial risks to aviation safety, military security, and personal privacy. Consequently, the development of effective anti-drone technologies has become a critical research focus. Among these, video-based anti-drone systems offer a cost-effective and accessible solution for widespread deployment. This paper addresses the challenges in anti-drone object tracking, such as background clutter, inaccurate localization, targets exiting the frame, and long-term tracking, by proposing enhanced algorithms based on deep learning.

Object tracking is a fundamental task in computer vision, aiming to locate and follow a target of interest across consecutive video frames. Traditional methods often struggle with the dynamic and complex nature of anti-drone scenarios, where drones exhibit rapid motion, scale variations, occlusion, and interactions with cluttered environments like urban skylines or forested areas. Deep learning-based trackers, particularly those using siamese networks, have shown promise due to their balance of accuracy and efficiency. In this work, we present two novel algorithms: SiamGR, which integrates a Global Attention Mechanism (GAM) and an Alpha-Refine module into the SiamCAR framework, and its variant SiamGR-FR, which further incorporates the Faster-RCNN object detector for re-localization. Our contributions aim to enhance feature representation, refine bounding box estimation, and improve robustness in long-term anti-drone tracking.

The core of our approach lies in addressing the unique demands of anti-drone operations. Drones are often small, fast-moving objects that can easily blend into backgrounds or disappear from view, making tracking exceptionally challenging. By leveraging attention mechanisms, we enhance the model’s ability to focus on relevant features amidst distractions. Additionally, refinement modules allow for precise localization, which is crucial for small targets like drones. The integration of object detection provides a fallback mechanism when tracking fails, ensuring continuous monitoring in anti-drone applications. This holistic design is evaluated on the DUT Anti-UAV dataset, a benchmark specifically designed for anti-drone tracking, where our methods demonstrate superior performance compared to state-of-the-art algorithms.

To provide context, we review related work in object tracking and anti-drone technologies. Siamese network-based trackers, such as SiamFC, SiamRPN++, and SiamBAN, have set benchmarks in generic tracking tasks. However, these methods are not optimized for anti-drone scenarios, which involve distinct challenges like low-resolution imagery, frequent occlusions, and long-term sequences. Previous anti-drone efforts have explored audio, radar, and RF detection, but video-based approaches remain underexplored. Our work bridges this gap by tailoring tracking algorithms to the specific needs of anti-drone surveillance, emphasizing adaptability and robustness in real-world environments.

Our methodology begins with an overview of the SiamGR framework. We use ResNet-50 as the backbone network for feature extraction from template and search frames. The GAM module is inserted into the backbone to amplify cross-dimensional interactions, reducing information loss and enhancing feature representation in complex anti-drone backgrounds. The Alpha-Refine module then processes the initial tracker output to produce more accurate bounding boxes via pixel-wise correlation and corner prediction heads. For SiamGR-FR, we integrate the Faster-RCNN detector, which periodically scans the frame to re-detect drones and reset the tracker, addressing issues like target loss in long-term anti-drone tracking.

Mathematically, the GAM module consists of channel and spatial attention submodules. Given an input feature map $ F_1 \in \mathbb{R}^{C \times H \times W} $, the channel attention map $ M_c $ is computed as:

$$ M_c(F_1) = \sigma(\text{MLP}(\text{Permute}(F_1))) $$

where $\sigma$ denotes the sigmoid function, MLP is a multi-layer perceptron with a reduction ratio $ r $, and Permute rearranges dimensions to preserve information. The output $ F_2 $ is:

$$ F_2 = M_c(F_1) \otimes F_1 $$

Here, $ \otimes $ represents element-wise multiplication. The spatial attention map $ M_s $ is then applied:

$$ M_s(F_2) = \sigma(\text{Conv}_{7\times7}(\text{Conv}_{7\times7}(F_2))) $$

where Conv${}_{7\times7}$ are convolutional layers. The final output $ F_3 $ is:

$$ F_3 = M_s(F_2) \otimes F_2 $$

This process enhances feature discriminability, crucial for anti-drone tracking where backgrounds are often noisy.

The Alpha-Refine module operates on a search region twice the target size, reducing background interference. Let $ K \in \mathbb{R}^{C \times H_0 \times W_0} $ be the template feature and $ S \in \mathbb{R}^{C \times H \times W} $ be the search feature. Pixel-wise correlation decomposes $ K $ into $ H_0 W_0 $ kernels $ K_j \in \mathbb{R}^{C \times 1 \times 1} $, producing correlation maps $ C_j $:

$$ C_j = K_j * S \quad \text{for} \quad j \in \{1, 2, \dots, H_0 \times W_0\} $$

where $ * $ denotes convolution. The aggregated correlation map $ C $ captures local spatial details, improving localization accuracy for small drone targets. The corner head predicts heatmaps for the top-left and bottom-right corners using stacked convolutional layers, enabling precise bounding box estimation in anti-drone scenarios.

For SiamGR-FR, we combine tracking with detection. The tracker SiamGR provides bounding boxes $ \text{tbbox} $ for each frame. Every 60 frames, the Faster-RCNN detector processes the current frame, yielding detection boxes $ \text{dbboxes} $ and confidence scores $ \text{dscores} $. If the maximum score $ \text{dscore} $ exceeds a threshold $ T_s = 0.8 $, the corresponding $ \text{dbbox} $ is used as the tracking result; otherwise, $ \text{tbbox} $ is retained. This hybrid approach ensures robustness in anti-drone applications, especially when drones exit the frame or undergo long occlusions.

We evaluate our algorithms on the DUT Anti-UAV dataset, which includes 20 video sequences with 24,804 frames, capturing various anti-drone challenges. Performance is measured using success rate and precision rate, based on intersection-over-union (IoU) and center location error (CLE), respectively. The CLE is defined as:

$$ e = \sqrt{(x_t – x_0)^2 + (y_t – y_0)^2} $$

where $ (x_t, y_t) $ is the predicted center and $ (x_0, y_0) $ is the ground truth. Precision rate is the proportion of frames with $ e < 20 $ pixels. The IoU is:

$$ \text{IoU} = \frac{A_t \cap A_{gt}}{A_t \cup A_{gt}} $$

where $ A_t $ is the predicted area and $ A_{gt} $ is the ground truth area. Success rate is the area under the curve of IoU thresholds from 0 to 1.

Our experimental setup uses an Intel Xeon W-2255 CPU, NVIDIA GTX 3090 GPU, and PyTorch 1.7.1. SiamGR is trained on combined datasets (DET, VID, COCO, YouTube-BB), while Faster-RCNN in SiamGR-FR is trained on the DUT Anti-UAV detection set. We compare against six state-of-the-art trackers: SiamRPN, DiMP, SiamCAR, SiamAttn, SiamAPN++, and SiamRPN++-RBO. The results demonstrate the effectiveness of our anti-drone tailored designs.

Algorithm	Success Rate	Precision Rate
SiamRPN	0.392	0.701
DiMP	0.577	0.830
SiamCAR	0.562	0.808
SiamAttn	0.568	0.810
SiamAPN++	0.426	0.661
SiamRPN++-RBO	0.589	0.828
SiamGR	0.615	0.842
SiamGR-FR	0.662	0.946

As shown in the table, SiamGR achieves a success rate of 61.5% and a precision rate of 84.2%, outperforming most baseline methods. SiamGR-FR further improves to 66.2% and 94.6%, highlighting the benefit of integrating detection in anti-drone tracking. These gains are attributed to the enhanced feature representation and refined localization, which are critical for handling drones in complex environments.

To dissect the contributions, we conduct ablation studies on the DUT Anti-UAV dataset. The baseline SiamCAR is incrementally augmented with GAM, Alpha-Refine, and Faster-RCNN modules. The results are summarized below:

Configuration	Success Rate	Precision Rate
SiamCAR (baseline)	0.562	0.808
+ GAM	0.577	0.825
+ Alpha-Refine	0.592	0.818
+ Faster-RCNN	0.644	0.923
SiamGR-FR (full)	0.662	0.946

Adding GAM alone increases success rate by 1.5% and precision by 1.7%, validating its role in boosting feature discrimination for anti-drone tasks. Alpha-Refine contributes a 3.0% success rate gain, emphasizing precise localization. The detector integration yields the most substantial improvement (8.2% success, 11.5% precision), underscoring its importance for long-term anti-drone tracking. The full SiamGR-FR model combines these advantages, achieving state-of-the-art performance.

We also analyze performance across specific challenges in anti-drone scenarios. The DUT Anti-UAV dataset categorizes sequences into nine attributes: background change, scale variation, low resolution, out-of-view, fast motion, small target, motion blur, long-term tracking, and occlusion. The success and precision rates for each challenge are computed, with SiamGR-FR consistently ranking first. For instance, in out-of-view situations common in anti-drone operations, SiamGR-FR maintains a success rate of 0.613, compared to 0.481 for SiamGR and lower for others, due to the detector’s re-localization capability. Similarly, for small targets, SiamGR-FR achieves a precision rate of 0.938, demonstrating its efficacy in handling diminutive drones.

Qualitative analysis on sequences v07, v11, v12, and v17 illustrates the practical benefits. In v07, where the drone exits the frame, SiamGR-FR quickly re-acquires the target, while SiamCAR drifts indefinitely. In v11, with background changes between buildings and sky, SiamGR and SiamGR-FR provide stable tracking. In v12 and v17, featuring fast motion and occlusion, SiamGR-FR persists accurately, whereas other methods falter. These visualizations confirm that our attention and refinement modules enhance robustness in dynamic anti-drone environments.

The sensitivity of SiamGR-FR to the confidence threshold $ T_s $ is also examined. Varying $ T_s $ from 0.6 to 0.9 causes less than 1% fluctuation in success rate, indicating robustness. This is vital for real-world anti-drone systems where parameter tuning may be limited. The algorithm’s efficiency is another consideration; SiamGR runs at approximately 40 FPS on our hardware, while SiamGR-FR, due to periodic detection, maintains over 30 FPS, suitable for real-time anti-drone applications.

Beyond empirical results, we delve into the theoretical implications. The GAM module’s ability to retain spatial-channel information aligns with the needs of anti-drone tracking, where drones often appear as small, textured objects against heterogeneous backgrounds. The Alpha-Refine module’s pixel-wise correlation offers a computational advantage over dense correlation, reducing memory usage while preserving accuracy—a key factor for deploying anti-drone systems on edge devices. The fusion with Faster-RCNN introduces a feedback loop that mitigates error accumulation, a common pitfall in long-term anti-drone tracking.

We further explore potential extensions. For instance, the attention mechanism could be adapted to temporal domains, leveraging motion cues specific to drones. The refinement module might incorporate shape priors to handle occlusions better. The detector could be replaced with lightweight alternatives for resource-constrained anti-drone setups. These directions promise to advance anti-drone technology further.

In conclusion, we present SiamGR and SiamGR-FR, two novel algorithms for anti-drone object tracking. By integrating a Global Attention Mechanism and an Alpha-Refine module, we enhance feature representation and localization accuracy. The fusion with Faster-RCNN addresses long-term tracking and out-of-view challenges, critical for anti-drone operations. Extensive evaluations on the DUT Anti-UAV dataset show significant improvements over existing methods, with SiamGR-FR achieving a success rate of 66.2% and a precision rate of 94.6%. Our work provides a robust foundation for video-based anti-drone systems, offering practical solutions for safety and security applications. Future research will focus on handling severe occlusions and optimizing computational efficiency for widespread anti-drone deployment.

The anti-drone domain continues to evolve, with increasing demands for reliable and accessible tracking solutions. Our contributions underscore the importance of tailoring computer vision techniques to specific challenges, such as those posed by drones. We hope this work inspires further innovation in anti-drone technology, ultimately contributing to safer skies and protected infrastructures. The integration of attention, refinement, and detection modules represents a step forward in making anti-drone systems more effective and deployable in diverse real-world scenarios.