Accurate and reliable positioning is paramount for Unmanned Aerial Vehicle (UAV) operations. While satellite navigation (GNSS) is the primary method, environments where GNSS signals are unavailable or unreliable (denied environments) pose significant challenges. Visual positioning techniques offer a viable alternative, often relying on matching UAV-captured imagery with georeferenced satellite imagery. However, significant differences in viewpoint, resolution, and appearance between UAV nadir images and satellite images hinder robust matching using traditional feature-based methods. This article presents a novel deep learning-based fusion geolocation method, FCN-FPI (FocalNet-Find Point Image), designed to overcome these limitations and achieve precise Unmanned Aerial Vehicle localization without GNSS.

1. Introduction
The proliferation of Unmanned Aerial Vehicles across diverse sectors – including surveillance, delivery, infrastructure inspection, disaster response, and precision agriculture – underscores the critical need for dependable positioning systems. GNSS dependence renders UAVs vulnerable in scenarios involving signal jamming, spoofing, physical obstruction (e.g., urban canyons, dense foliage, or indoor operations), or natural interference. In such GNSS-denied environments, Unmanned Aerial Vehicles risk navigation failure or mission compromise. Alternative positioning strategies include:
- Inertial Navigation Systems (INS): Provide continuous position estimates using accelerometers and gyroscopes but suffer from unbounded drift over time, requiring frequent correction.
- Simultaneous Localization and Mapping (SLAM): Effective but computationally demanding, reliant on real-time environmental mapping, and potentially limited by the absence of prior knowledge or feature-poor environments.
- Vision-Based Localization: Leverages onboard cameras. A prominent approach involves cross-view geolocation, where a UAV image is matched to a georeferenced satellite image database to determine the UAV’s global position.
Existing vision-based UAV localization methods generally fall into two categories:
- Image Retrieval: Treats localization as an image retrieval task. The UAV image is compared against a database of geo-tagged satellite images. The location associated with the most similar satellite image is assigned to the UAV. While conceptually straightforward, this method suffers from scalability issues. Covering large operational areas requires massive databases, leading to prohibitive storage and computational costs during the query phase (comparing the UAV image to every database entry). Furthermore, incomplete coverage leaves “blind spots” where localization is impossible.
- Find Point Image (FPI): Formulates localization as a direct regression problem within the satellite image space. An end-to-end network processes the UAV image and a large satellite image patch, outputting a heatmap over the satellite patch. The peak of this heatmap indicates the predicted UAV location. This method avoids exhaustive database searches and offers the potential for direct, meter-level positioning.
Despite progress, challenges persist in cross-view UAV-to-satellite image matching: significant domain gaps, handling multi-scale variations (UAV altitude, satellite image resolution), and effectively fusing multi-level features for precise heatmap generation. This work introduces FCN-FPI, a method integrating a powerful transformer-based backbone, a dedicated feature fusion module, and an optimized loss function to enhance Unmanned Aerial Vehicle geolocation accuracy and robustness in denied environments.
2. FCN-FPI Methodology Overview
The FCN-FPI framework addresses Unmanned Aerial Vehicle localization by predicting the precise coordinates of the UAV within a provided satellite image patch based solely on the UAV’s downward-facing camera image. The architecture comprises three core stages: Image Preprocessing, Deep Feature Extraction and Fusion, and UAV Position Solving. The overall workflow is described below and visualized conceptually (avoiding figure references).
- Input: A UAV nadir image (384x384x3) and a corresponding satellite image patch (256x256x3) presumed to contain the UAV’s location.
- Output: A heatmap (256x256x1) over the satellite image patch where the maximum value indicates the predicted UAV position.
2.1 Image Preprocessing
Training and evaluation utilize the UL14 dataset, specifically designed for cross-view UAV geolocation. This dataset features dense sampling:
- Training Set: 6,768 UAV images paired 1:1 with 6,768 satellite images covering 10 distinct university campuses. UAV images are captured at precisely controlled altitudes of 80m, 90m, and 100m.
- Test Set: 2,331 UAV images paired with 27,972 satellite images (ratio 1:12) covering 4 different university campuses. Satellite patches vary significantly in scale (ground coverage per pixel), with side lengths ranging from 180m to 463m. UAV images in the test set are captured at the same altitudes (80m, 90m, 100m) as the training set. To rigorously evaluate generalization, test set UAV images are paired with multiple satellite patches of different scales and potentially slightly shifted viewpoints. Data augmentation techniques are applied during training to enhance model robustness.
2.2 Deep Feature Extraction and Fusion
FCN-FPI employs a dual-stream architecture to process UAV and satellite images independently, acknowledging the significant domain shift between these modalities. Weight sharing between streams, common in Siamese networks for same-domain tasks, is ineffective here.
- Backbone Network: FocalNet. Both streams utilize a transformer-based architecture called FocalNet as the feature extractor. FocalNet builds upon Vision Transformers (ViT) but replaces standard self-attention with a Focal Modulation module and incorporates a Context Aggregation module.
- Focal Modulation: Enhances local feature interaction and adaptability. It processes features by focusing modulation on relevant contexts, allowing more flexible adjustment of feature responses compared to standard self-attention. (Conceptual description replaces figure reference).
- Context Aggregation: Hierarchically extracts and aggregates global context information using gating mechanisms, crucial for understanding the broader scene layout in dense prediction tasks like geolocation.
- Feature Extraction Levels: Only the first three stages of the FocalNet backbone are used for both streams. This outputs multi-scale feature maps:
- UAV Stream: Features
X1
(H/8 x W/8 x 192),X2
(H/16 x W/16 x 384),X3
(H/32 x W/32 x 768) - Satellite Stream: Features
Z1
(H/8 x W/8 x 192),Z2
(H/16 x W/16 x 384),Z3
(H/32 x W/32 x 768)
Using only early stages avoids overly compressed low-resolution features from the final stage, preserving spatial detail needed for accurate localization.
- UAV Stream: Features
- Cross-Level Multi-Feature Fusion (CLMF) Module: This novel module integrates features from different levels within each stream and then performs cross-modal fusion crucial for prediction. Low-resolution deep features (
X3
,Z3
) contain rich semantic information but lack spatial precision. High-resolution shallow features (X1
,Z1
) retain fine spatial details but lack semantic context. CLMF combines these effectively:- Intra-Stream Fusion (Pyramid Fusion): Applies a Feature Pyramid Network (FPN) like structure within each stream. Taking the UAV stream features (
X1
,X2
,X3
) as an example:X3
(H/32 x W/32 x 768) is processed with a 1×1 convolution to reduce channels to 256, yieldingY3'
.X2
(H/16 x W/16 x 384) is processed with a 1×1 convolution to reduce channels to 256, yieldingY2'
.Y3'
is upsampled 2x and fused withY2'
(e.g., element-wise addition), resulting inY2
.X1
(H/8 x W/8 x 192) is processed with a 1×1 convolution to increase channels to 256, yieldingY1'
.Y2
is upsampled 2x and fused withY1'
, resulting inY1
.
This process generates a set of fused UAV features{Y1, Y2, Y3}
where each level now combines semantic and spatial information. The same process is applied to the satellite stream features{Z1, Z2, Z3}
to generate{M1, M2, M3}
. FeaturesY3
andM3
are particularly rich due to the fusion.
- Cross-Stream Fusion: The fused high-level features from both streams (
Y3
andM3
) are combined.Y3
is bilinearly interpolated to match the spatial dimensions ofM3
(H/32 x W/32). These aligned features are then fused (e.g., via concatenation or element-wise multiplication) to generate a joint representation encoding the relationship between the UAV view and the satellite view at this semantic level. (Conceptual description replaces figure reference).
- Intra-Stream Fusion (Pyramid Fusion): Applies a Feature Pyramid Network (FPN) like structure within each stream. Taking the UAV stream features (
2.3 UAV Position Solving
The fused feature map resulting from the CLMF module (based on Y3
and M3
) undergoes convolutional processing. This processed feature map is then upsampled via bilinear interpolation to match the spatial resolution of the input satellite image (256×256). The result is a heatmap H
of size 256×256. Each pixel value H(i,j)
in this heatmap represents the model’s confidence that the Unmanned Aerial Vehicle is located at the corresponding geographical position within the satellite patch. The final predicted UAV position (P_x, P_y)
is obtained by finding the coordinates of the maximum value in the heatmap:
P_x, P_y = argmax_{i,j} H(i,j)
2.4 Optimized Loss Function: Gaussian Window Loss
Training the FCN-FPI network requires a loss function that effectively guides the heatmap prediction. Standard losses used in image retrieval (e.g., Triplet Loss, Instance Loss) are unsuitable for this direct regression formulation. The loss must encourage the heatmap peak to align precisely with the true UAV location within the satellite image.
- Target Representation: The ground truth label is a heatmap
G
of the same size as the input satellite image (256×256). ACenter-R
parameter defines a square region around the true location center(C_x, C_y)
as positive. SettingCenter-R = 33
creates a 33×33 pixel square centered on(C_x, C_y)
. Pixels within this square are positive samples; all others are negative samples. - Weighting Strategies: Assigning equal weight (
Average Window
) within the positive region ignores the intuition that locations closer to the true center should be more important. TheHann Window
(a cosine-based weighting) has been used previously (e.g., in FPI) to focus attention towards the center. However, its relatively wide main lobe can hinder precise convergence to the exact center point in later training stages. (Conceptual description replaces figure reference). - Gaussian Window Loss: FCN-FPI proposes a novel
Gaussian Window Loss
(GWL) to overcome these limitations. The GWL assigns weights within the positive region using a 2D Gaussian function centered on(C_x, C_y)
:
G(i,j) = \frac{1}{2\pi\sigma^2} \exp\left(-\frac{(i – C_x)^2 + (j – C_y)^2}{2\sigma^2}\right)
where:(i, j)
are pixel coordinates within the satellite image heatmap.(C_x, C_y)
are the coordinates of the true UAV location center.σ
is the standard deviation controlling the spread of the Gaussian kernel. A smallerσ
creates a narrower peak, forcing the model to predict very close to the center. The choice ofσ
is a hyperparameter.
The target heatmapG
is normalized so its values sum to 1. The model’s predicted heatmapH
is also normalized (e.g., using softmax or scaled to sum to 1). The loss function is typically the Kullback-Leibler (KL) Divergence or Mean Squared Error (MSE) between the predicted heatmapH
and the Gaussian-weighted target heatmapG
:
\mathcal{L}{GWL} = D{KL}(G || H) \quad \text{or} \quad \mathcal{L}{GWL} = \frac{1}{N} \sum{i=1}^{256} \sum_{j=1}^{256} (G(i,j) – H(i,j))^2
whereN
is the total number of pixels (256*256). The Gaussian window’s narrow peak provides a stronger gradient signal towards the exact center location compared to the Hann window, significantly improving fine-grained localization accuracy. (Conceptual description replaces figure reference).
Table 1: Gaussian Window Loss Parameters
Parameter | Symbol | Role | Typical Value/Consideration |
---|---|---|---|
Center Coordinates | (C_x, C_y) | True UAV location in satellite image | Ground Truth |
Standard Deviation | σ | Controls the width/spread of the Gaussian focus. | Hyperparameter (e.g., ~Center-R/3). Smaller σ = sharper focus on center. |
Positive Region Size | Center-R | Defines the square area considered “positive” around (C_x, C_y) . | Hyperparameter (e.g., 33 pixels). Affects training stability. |
Loss Type | D_{KL} or MSE | Measures discrepancy between target G and prediction H . | KL Divergence often used for probability distributions. |
3. Experimental Setup and Evaluation
3.1 Dataset: UL14
The UL14 dataset is a large-scale, densely sampled benchmark designed specifically for UAV cross-view geolocation research. Its characteristics are summarized below:
Table 2: UL14 Dataset Composition
Split | Unmanned Aerial Vehicle Images | Satellite Images | Campus Locations | UAV:Satellite Ratio | UAV Altitudes | Satellite Patch Size (Pixel) | Satellite Patch Scale (Ground) |
---|---|---|---|---|---|---|---|
Training | 6,768 | 6,768 | 10 | 1:1 | 80m, 90m, 100m | 1280×1280 | Fixed per campus |
Test | 2,331 | 27,972 | 4 | 1:12 | 80m, 90m, 100m | 256×256 | 180m – 463m per side |
Key Features:
- Dense Sampling: Multiple UAV images captured per location at controlled altitudes.
- Scale Variation: Satellite patches in the test set cover significantly different ground areas, posing a major challenge.
- Real-World Scenes: Focuses on university campuses with diverse terrain and structures.
- Testing Rigor: The 1:12 ratio in the test set evaluates the model’s ability to find the Unmanned Aerial Vehicle location within a large search space defined by multiple satellite patches.
3.2 Implementation Details
- Hardware: NVIDIA GeForce GTX 1080 Ti GPU.
- Software: Python 3.7, PyTorch 1.10.2.
- Training:
- Batch Size: 16
- Epochs: 30
- Initial Learning Rate: 0.0001
- Learning Rate Schedule: Reduced by a factor of 5 at epochs 10, 14, and 16.
- Optimizer: Adam or SGD (implied standard practice, details often omitted in paper summaries).
- Backbone: FocalNet (pretrained weights likely used).
- Loss: Gaussian Window Loss (σ and Center-R optimized).
3.3 Evaluation Metrics
Assessing Unmanned Aerial Vehicle geolocation performance requires metrics sensitive to both absolute positioning error and robustness to scale variations inherent in satellite imagery.
- Meter-Level Accuracy (MA@K): Measures the percentage of test samples where the predicted location is within
K
meters of the true location. Calculated using geodesic distance based on latitude and longitude.- For each sample
i
, compute the spatial errore_i
(in meters). - Define an indicator:
I_i = \begin{cases} 1, & \text{if } e_i \leq K \ 0, & \text{otherwise} \end{cases} - MA@K = \frac{1}{N} \sum_{i=1}^{N} I_i \times 100%
whereN
is the total number of test samples. Common thresholdsK
are 5m and 20m (MA@5, MA@20). Higher values indicate better performance.
- For each sample
- Relative Distance Score (RDS): Addresses the limitation of MA@K when satellite patch scales vary dramatically. A prediction might be only a few pixels off in a high-resolution patch (small ground error) but many pixels off in a low-resolution patch (large ground error), even if the pixel error is similar relative to the patch size. RDS normalizes the error based on the satellite image dimensions:
RDS = \exp\left( -\frac{ \left( \frac{d_x}{W} \right)^2 + \left( \frac{d_y}{H} \right)^2 }{2 \alpha^2} \right)
where:d_x
,d_y
: Pixel errors along the x-axis and y-axis between the predicted location(P_x, P_y)
and the true location(C_x, C_y)
.W
,H
: Width and height of the satellite image patch in pixels.α
: A scaling factor (set to 10 in this work).
RDS ranges from 0 to 1, with values closer to 1 indicating smaller normalized errors. It measures how well the model predicts the location relative to the satellite image context, independent of the absolute ground scale. The average RDS over the test set is reported.
Table 3: Geolocation Evaluation Metrics
Metric | Formula | Interpretation | Strengths | Weaknesses |
---|---|---|---|---|
MA@K | 1N∑i=1NI(ei≤K)×100%N1∑i=1NI(ei≤K)×100% | % of UAVs located within K meters. | Intuitive, absolute error measure. Directly relevant to application. | Sensitive to satellite image scale. Performance drops naturally with larger scale patches. |
RDS | exp(−(dx/W)2+(dy/H)22α2)exp(−2α2(dx/W)2+(dy/H)2) | Normalized localization quality score (0-1). | Scale-invariant. Measures prediction quality relative to the satellite image context. | Less intuitive direct physical meaning. Depends on hyperparameter α . |
4. Results and Analysis
4.1 Ablation Study: Backbone Network
The choice of feature extraction backbone significantly impacts Unmanned Aerial Vehicle localization performance. We compare FocalNet against prominent Vision Transformer (ViT) variants using the UL14 test set under identical settings (CLMF module, Gaussian Loss). Results demonstrate FocalNet’s superiority.
Table 4: Backbone Network Comparison (MA in %, RDS in %)
Backbone | MA@5 | MA@20 | RDS |
---|---|---|---|
ViT | 19.03 | 56.25 | 56.20 |
DeiT | 18.63 | 57.67 | 57.22 |
PVT | 14.23 | 50.88 | 52.37 |
PCPVT | 15.79 | 60.04 | 58.01 |
FocalNet | 22.47 | 63.85 | 63.57 |
Analysis: FocalNet consistently outperforms other backbones across all metrics. The improvements over ViT/DeiT (+3.44-6.18% in MA@5, +6.18-7.20% in MA@20, +6.37-6.35% in RDS) highlight the effectiveness of its Focal Modulation and Context Aggregation modules for capturing the complex spatial and semantic relationships needed for precise Unmanned Aerial Vehicle localization in cross-view scenarios. PVT variants show lower performance, potentially due to architectural differences less suited to this task.
4.2 Ablation Study: Loss Function
Using the optimal FocalNet backbone, we evaluate the impact of the proposed Gaussian Window Loss (GWL) against the commonly used Hann Window Loss.
Table 5: Loss Function Ablation (MA in %, RDS in %)
Loss Function | MA@5 | MA@20 | RDS |
---|---|---|---|
Hann Window | 23.48 | 66.71 | 67.73 |
Gaussian Window (GWL) | 26.31 | 69.58 | 69.74 |
Analysis: The Gaussian Window Loss provides substantial gains: +2.83% in MA@5, +2.87% in MA@20, and +2.01% in RDS compared to the Hann Window Loss. This confirms the hypothesis that the sharper gradient provided by the Gaussian’s narrow peak around the true center location enables more precise convergence during training, leading to significantly improved localization accuracy for the Unmanned Aerial Vehicle.
4.3 Comparison with State-of-the-Art
We compare the full FCN-FPI model (FocalNet backbone + CLMF + GWL) against two representative prior methods on the UL14 test set: the foundational FPI and the recent WAMF-FPI.
Table 6: State-of-the-Art Comparison (MA in %, RDS in %)
Method | MA@5 | MA@20 | RDS |
---|---|---|---|
FPI | 18.63 | 57.67 | 57.22 |
WAMF-FPI | 24.94 | 67.28 | 65.33 |
FCN-FPI (Ours) | 26.31 | 69.58 | 69.74 |
Analysis:
- FCN-FPI significantly outperforms the baseline FPI method (+7.68% MA@5, +11.91% MA@20, +12.52% RDS), demonstrating the effectiveness of the integrated architectural improvements (FocalNet, CLMF) and the optimized loss (GWL).
- FCN-FPI also achieves clear improvements over the recent WAMF-FPI method (+1.37% MA@5, +2.30% MA@20, +4.41% RDS). This establishes FCN-FPI as a new state-of-the-art for UAV cross-view geolocation on the challenging UL14 benchmark. The improvements in RDS are particularly notable, indicating better robustness to the scale variations present in the satellite patches.
Visualization: Qualitative results (avoiding figure references) show FCN-FPI heatmaps exhibiting sharper, more concentrated peaks centered near the true UAV location within diverse satellite patches, even those covering large areas or containing complex scenes, compared to the heatmaps generated by FPI or WAMF-FPI. Error vectors depicting the distance between predicted and true locations are visibly smaller for FCN-FPI across numerous test cases.
5. Conclusion and Engineering Value
This article presented FCN-FPI, a novel deep learning framework for precise Unmanned Aerial Vehicle localization in GNSS-denied environments. FCN-FPI directly regresses the UAV’s position within a satellite image patch based on the UAV’s nadir view. Key innovations include:
- Transformer-Based Feature Extraction: Utilizing the FocalNet backbone with Focal Modulation and Context Aggregation for effective multi-scale feature learning from both UAV and satellite imagery.
- Cross-Level Multi-Feature Fusion (CLMF): A dedicated module combining hierarchical features within each modality and performing cross-modal fusion to generate a rich joint representation linking the UAV view to the satellite context.
- Gaussian Window Loss (GWL): A novel loss function applying Gaussian weighting centered on the true UAV location, providing stronger gradients for precise localization compared to previous windowing functions like Hann.
Rigorous evaluation on the UL14 benchmark demonstrates FCN-FPI’s superiority. Ablation studies confirm the contributions of the FocalNet backbone (+3.44-7.20% MA@20 over other ViTs) and the Gaussian Window Loss (+2.87% MA@20 over Hann Loss). FCN-FPI achieves state-of-the-art results, outperforming FPI by +11.91% MA@20 and WAMF-FPI by +2.30% MA@20 and +4.41% RDS.
Engineering Value for Unmanned Aerial Vehicle Operations:
The advancements embodied in FCN-FPI translate directly into tangible benefits for UAV applications operating where GNSS is unreliable or unavailable:
- Enhanced Robustness and Accuracy: Improvements in MA@20 (69.58%) and RDS (69.74%) signify a system capable of locating a UAV within approximately 20 meters nearly 70% of the time in complex, scale-variant environments, a critical threshold for many practical tasks. The scale-invariant RDS improvement highlights robustness.
- Operational Continuity: Enables Unmanned Aerial Vehicles to maintain positioning and navigation capabilities in critical denied environments (urban canyons, near sensitive infrastructure, during electronic warfare scenarios, under dense canopy).
- Reduced Infrastructure Dependence: Less reliance on pre-deployed beacons or dense communication networks for fallback positioning.
- Enabling Complex Missions: Reliable vision-based localization is foundational for autonomous missions in challenging settings:
- Search and Rescue: Locating survivors or assessing damage in disaster zones (earthquakes, floods) where GNSS or infrastructure is compromised.
- Precision Infrastructure Inspection: Autonomous close inspection of power lines, pipelines, bridges, or wind turbines, often requiring flight paths where GNSS signals are obstructed.
- Confined Space Operations: Navigation within warehouses, mines, or underground structures.
- Military and Security: Covert surveillance, reconnaissance, and patrol in contested or GPS-denied areas.
- Urban Air Mobility (UAM): Potential component for navigation in dense urban environments with complex signal multipath and obstruction.
- Scalability: The FPI paradigm avoids the database scaling issues of retrieval-based methods, making FCN-FPI potentially more scalable for large-area operations.
Table 7: FCN-FPI Engineering Impact on Unmanned Aerial Vehicle Applications
Application Domain | Challenge in Denied Environment | FCN-FPI Contribution |
---|---|---|
Disaster Response (SAR, Damage Assess.) | Collapsed buildings/jammed signals block GNSS. Need precise location for victims/assets. | Enables UAV positioning & mapping without GNSS. Accurate location reporting. |
Critical Infrastructure Inspect. (Power, Pipelines) | GNSS unreliable near tall structures/vegetation. Precise positioning needed for defect logging. | Allows autonomous flight & accurate geo-tagging of inspection findings without GNSS. |
Indoor/Confined Space Operation | GNSS completely unavailable inside structures/mines. | Provides primary positioning capability based on visual matching to prior maps/surveys. |
Military/Security (Recce, Patrol) | GNSS jamming/spoofing in contested areas. Covert ops require non-emitting navigation. | Offers passive, jamming-resistant positioning based on visual scene matching. |
Urban Air Mobility (UAM) Navigation | Severe GNSS multipath/degradation in dense cities. Need robust backup/primary nav. | Potential component of a multi-sensor navigation suite for urban drone operations. |
Future Work: Directions include exploring temporal fusion for video input, further optimizing inference speed for real-time deployment on Unmanned Aerial Vehicle platforms, integrating FCN-FPI with INS for smoother navigation, adapting to extreme weather/lighting conditions, and extending the approach to global localization without an initial position estimate. FCN-FPI represents a significant step towards achieving reliable, autonomous Unmanned Aerial Vehicle operation in the most demanding environments.