In recent years, the rapid advancement of drone technology has led to the widespread adoption of Unmanned Aerial Vehicles (UAVs) in various applications, including surveillance, delivery, and environmental monitoring. However, one of the critical challenges faced by Unmanned Aerial Vehicles is maintaining stable self-localization in environments where Global Navigation Satellite System (GNSS) signals are unavailable or degraded, commonly referred to as denial environments. Traditional methods, such as Inertial Navigation Systems (INS) and Simultaneous Localization and Mapping (SLAM), have limitations, including cumulative errors over time and dependency on real-time map construction. To address these issues, vision-based approaches leveraging deep learning have emerged as promising solutions. In this paper, we propose a novel method called FCN-FPI (FocalNet-Find Point Image), which integrates geolocation techniques to enhance the autonomy and robustness of Unmanned Aerial Vehicles in GPS-denied scenarios.

The core of our approach lies in utilizing deep learning to perform cross-view image matching between drone-captured images and pre-existing satellite imagery. Unlike image retrieval-based methods that require extensive databases and suffer from computational inefficiencies, our FCN-FPI method directly predicts the UAV’s position on a satellite map through an end-to-end framework. This not only reduces storage and inference time but also improves real-time performance, which is crucial for dynamic drone operations. By leveraging transformer-based architectures and multi-scale feature fusion, we achieve significant gains in localization accuracy, as demonstrated through rigorous experiments on dense datasets.
Drone technology has evolved to incorporate sophisticated sensors and algorithms, enabling Unmanned Aerial Vehicles to navigate complex environments. However, in denial environments, such as urban canyons or indoor spaces, GNSS signals are often blocked, leading to potential failures in autonomous flight. Vision-based localization methods offer a viable alternative by analyzing visual data from onboard cameras. These methods can be broadly categorized into image retrieval and direct point localization. While image retrieval techniques match drone images against a database of satellite images, they are computationally expensive and may not cover all possible regions. In contrast, direct point localization, as implemented in FPI-based methods, identifies the exact location of the Unmanned Aerial Vehicle on a satellite image by generating heatmaps that highlight the most probable positions. Our work builds upon this concept by introducing enhancements in feature extraction, fusion, and loss functions to further boost performance.
Methodology
The proposed FCN-FPI method consists of three main components: image preprocessing, deep feature extraction and fusion, and UAV position resolution. We employ a dual-stream network architecture without weight sharing between the drone and satellite branches to account for the heterogeneous nature of the images. This design allows the model to adapt to the distinct characteristics of each view, such as resolution and perspective differences, which are common in drone technology applications.
Feature Extraction Module
For feature extraction, we use FocalNet as the backbone network, which is based on the Transformer architecture and incorporates focal modulation and context aggregation modules. The focal modulation module enhances local feature interactions by adjusting feature representations dynamically, while context aggregation hierarchically extracts and combines global context information. This makes FocalNet particularly suitable for dense prediction tasks like geolocation. Given an input drone image \(Z\) of size \(384 \times 384 \times 3\) and a satellite image \(X\) of size \(256 \times 256 \times 3\), we extract multi-scale features from the first three stages of FocalNet. The output dimensions are as follows:
- Stage 1: \(\frac{H}{8} \times \frac{W}{8} \times 2C\)
- Stage 2: \(\frac{H}{16} \times \frac{W}{16} \times 4C\)
- Stage 3: \(\frac{H}{32} \times \frac{W}{32} \times 8C\)
where \(H\) and \(W\) are the input height and width, and \(C\) is the base channel count set to 192, 384, and 768 for each stage, respectively. This multi-scale approach captures both low-level spatial details and high-level semantic information, which is essential for accurate localization of Unmanned Aerial Vehicles. The mathematical formulation of the focal modulation in FocalNet can be represented as a function that modulates feature interactions based on contextual importance, though we omit the detailed equations here for brevity.
Feature Fusion Module
To effectively combine the multi-scale features, we design a Cross-Level Multi-Feature Fusion (CLMF) module. This module integrates features from different levels of the network using upsampling and lateral connections, similar to a feature pyramid network. Let \(\{X_1, X_2, X_3\}\) denote the feature maps from the drone branch and \(\{Z_1, Z_2, Z_3\}\) from the satellite branch. The fusion process involves the following steps:
- Apply a \(1 \times 1\) convolution to \(X_3\) to reduce channels from 768 to 256, producing \(Y_1\).
- Apply a \(1 \times 1\) convolution to \(X_2\) to reduce channels from 384 to 256, then fuse with upsampled \(Y_1\) (by a factor of 2) to get \(Y_2\).
- Apply a \(1 \times 1\) convolution to \(X_1\) to increase channels to 256, then fuse with upsampled \(Y_2\) to obtain \(Y_3\).
The same process is applied to the satellite features, resulting in fused features \(\{M_1, M_2, M_3\}\). The final fused features, \(Y_3\) and \(M_3\), are rich in both spatial and semantic information. We then perform a bilinear interpolation on \(Y_3\) to match the size of \(M_3\), enabling the generation of a heatmap that predicts the UAV’s position on the satellite image. This fusion mechanism enhances the model’s ability to handle scale variations and occlusions, which are common challenges in drone technology.
Loss Function
To train the model, we introduce a Gaussian window loss function that prioritizes the center region of the target location. Traditional loss functions like instance loss or triplet loss are designed for image retrieval and are not suitable for direct point localization. The Gaussian loss assigns higher weights to pixels closer to the ground truth location, guiding the model to focus on the most critical areas. The 2D Gaussian window function is defined as:
$$S(m,n,\gamma) = \frac{1}{2\pi\gamma} \exp\left(-\frac{m^2 + n^2}{2\gamma^2}\right)$$
where \(m\) and \(n\) are the horizontal and vertical coordinates within the window, and \(\gamma\) is the standard deviation controlling the spread of the Gaussian. This function creates a weight distribution that peaks at the center, encouraging the model to predict locations near the true position. Compared to average or Hanning window losses, the Gaussian loss provides a narrower main lobe, leading to better convergence and higher accuracy in geolocation tasks for Unmanned Aerial Vehicles.
Experimental Setup and Results
We evaluate our method on the UL14 dataset, a large-scale densely sampled dataset containing drone images captured at heights of 80m, 90m, and 100m, along with corresponding satellite images. The training set includes 6,768 drone images and an equal number of satellite images from 10 universities, while the test set has 2,331 drone images and 27,972 satellite images from 4 universities. This ratio of 1:12 in the test set allows for comprehensive evaluation across different scales and environments. The satellite images vary in ground coverage, with side lengths ranging from 180m to 463m, simulating real-world conditions in drone technology applications.
Our experiments are conducted on a 1080TI GPU using Python 3.7 and PyTorch 1.10.2. The model is trained for 30 epochs with a batch size of 16 and an initial learning rate of 0.0001, which is reduced by a factor of 5 at epochs 10, 14, and 16. We use two main evaluation metrics: Meter-Level Accuracy (MA@K) and Relative Distance Score (RDS). MA@K measures the percentage of predictions within \(K\) meters of the true location, calculated as:
$$MA@K = \frac{1}{N} \sum_{i=1}^{N} I_i$$
where \(I_i = 1\) if the error \(e_i < K\), and 0 otherwise, with \(N\) being the total number of test samples. RDS accounts for scale variations by computing a normalized distance based on image dimensions:
$$RDS = e^{-k \cdot \sqrt{\left(\frac{dx}{w}\right)^2 + \left(\frac{dy}{h}\right)^2}}$$
where \(dx\) and \(dy\) are the pixel distances between predicted and true positions, \(w\) and \(h\) are the width and height of the satellite image, and \(k\) is a scale factor set to 10.
Ablation Studies
We perform ablation studies to assess the impact of different backbone networks and loss functions. Table 1 compares various backbones using MA@5, MA@20, and RDS metrics, with all models trained under the same conditions. FocalNet consistently outperforms others, demonstrating its superiority in feature extraction for UAV localization.
| Backbone | MA@5 | MA@20 | RDS |
|---|---|---|---|
| ViT | 19.03 | 56.25 | 56.20 |
| Deit | 18.63 | 57.67 | 57.22 |
| PVT | 14.23 | 50.88 | 52.37 |
| PCPVT | 15.79 | 60.04 | 58.01 |
| FocalNet | 22.47 | 63.85 | 63.57 |
Next, we evaluate the loss function by comparing Haning window loss and Gaussian window loss, both using FocalNet as the backbone. As shown in Table 2, Gaussian loss achieves higher accuracy, highlighting its effectiveness in focusing on central regions for improved geolocation.
| Loss Function | MA@5 | MA@20 | RDS |
|---|---|---|---|
| Haning | 23.48 | 66.71 | 67.73 |
| Gaussian | 26.31 | 69.58 | 69.74 |
Comparison with State-of-the-Art Methods
We compare FCN-FPI with existing methods like FPI and WAMF-FPI on the UL14 dataset. The results in Table 3 show that our method achieves the best performance across all metrics, underscoring the advantages of our feature fusion and loss design in drone technology.
| Algorithm | MA@5 | MA@20 | RDS |
|---|---|---|---|
| FPI | 18.63 | 57.67 | 57.22 |
| WAMF-FPI | 24.94 | 67.28 | 65.33 |
| FCN-FPI | 26.31 | 69.58 | 69.74 |
The improvements in MA@20 (from 67.28% to 69.58%) and RDS (from 65.33% to 69.74%) demonstrate that FCN-FPI effectively enhances localization precision. Visualizations of the heatmaps confirm that the model accurately predicts UAV positions even in challenging scenarios, with errors typically within a few meters. This level of accuracy is critical for applications in autonomous navigation and mission-critical operations using Unmanned Aerial Vehicles.
Conclusion
In this paper, we present FCN-FPI, a deep learning-based method for autonomous localization of Unmanned Aerial Vehicles in denial environments. By leveraging a transformer-based FocalNet backbone, a cross-level feature fusion module, and a Gaussian window loss function, our approach achieves state-of-the-art performance on dense datasets. The method addresses key limitations of existing techniques, such as computational inefficiency and scale sensitivity, making it suitable for real-world drone technology applications. Future work will focus on extending the framework to handle dynamic environments and integrating additional sensor modalities for enhanced robustness. As drone technology continues to evolve, solutions like FCN-FPI will play a vital role in ensuring the reliability and autonomy of Unmanned Aerial Vehicles across diverse scenarios.
