Deep Learning-Based Autonomous Localization for Unmanned Aerial Vehicles in GNSS-Denied Environments

I. Introduction
The rapid proliferation of Unmanned Aerial Vehicles (UAVs) across industries like logistics, surveillance, and disaster response underscores the critical need for reliable positioning systems. Traditional UAVs predominantly depend on Global Navigation Satellite Systems (GNSS), rendering them vulnerable in denial environments where signals are obstructed (e.g., urban canyons, dense forests, or adversarial jamming). In such scenarios, vision-based localization emerges as a vital alternative. However, cross-view matching—aligning a UAV’s nadir imagery with pre-existing satellite maps—faces significant hurdles due to drastic perspective shifts, resolution mismatches, and lighting variations. Existing methods, including Siamese networks for image retrieval or heatmap-based point estimation, struggle with robustness and real-time efficiency. To overcome these limitations, we propose FCN-FPI (FocalNet-Find Point Image), a novel deep learning framework integrating multi-scale feature extraction, cross-level fusion, and optimized loss design for precise UAV geolocation without GNSS.

II. Methodology
II.A. Overall Architecture
FCN-FPI employs a dual-stream pipeline processing UAV nadir images and satellite maps independently (Fig. 1). Unlike Siamese networks, our design avoids weight sharing between streams to accommodate heterogeneous sensor characteristics. The workflow comprises:

Input Preprocessing: UAV images (384×384×3) and satellite images (256×256×3) are standardized.
Multi-Scale Feature Extraction: A Transformer-based FocalNet backbone processes both inputs.
Cross-Level Feature Fusion: Our CLMF module hierarchically combines shallow (high-resolution) and deep (semantic-rich) features.
Position Estimation: A heatmap predicts the UAV’s coordinates in the satellite map.

Figure 1: FCN-FPI Pipeline

Component	Input	Output
UAV Stream	384×384×3 image	Multi-scale features {X1,X2,X3}{X1,X2,X3}
Satellite Stream	256×256×3 image	Multi-scale features {Z1,Z2,Z3}{Z1,Z2,Z3}
CLMF Fusion	{Xi},{Zi}{Xi},{Zi}	Fused features {Y1,Y2,Y3}{Y1,Y2,Y3}, {M1,M2,M3}{M1,M2,M3}
Geolocation Head	Y3,M3Y3,M3	Heatmap → Predicted (x,y)

II.B. FocalNet Backbone
FocalNet replaces standard Vision Transformers (ViT) with Focal Modulation and Context Aggregation modules, enhancing local feature refinement and global context integration. Given an input image II, features are extracted at three scales:Stage 1:F1=FocalMod(I)∈RH8×W8×192Stage 2:F2=FocalMod(F1)∈RH16×W16×384Stage 3:F3=FocalMod(F2)∈RH32×W32×768Stage 1:Stage 2:Stage 3:F1=FocalMod(I)∈R8H×8W×192F2=FocalMod(F1)∈R16H×16W×384F3=FocalMod(F2)∈R32H×32W×768

*Table 1: Backbone Comparison (MA@20 / RDS)*

Backbone	MA@5 (%)	MA@20 (%)	RDS (%)
ViT	19.03	56.25	56.20
PVT	14.23	50.88	52.37
FocalNet	22.47	63.85	63.57

II.C. CLMF Feature Fusion Module
Shallow features (X1,Z1X1,Z1) retain spatial details but lack semantics, while deep features (X3,Z3X3,Z3) encode abstract context at low resolution. Our Cross-Level Multi-feature Fusion (CLMF) module bridges this gap via upsampling and lateral connections:

Reduce dimensionality of X1X1 (12×12×768 → 12×12×256) via 1×1 convolution → Y1Y1.
Upsample Y1Y1 2× → 24×24×256. Fuse with down-projected X2X2 (24×24×384 → 24×24×256) → Y2Y2.
Upsample Y2Y2 2× → 48×48×256. Fuse with down-projected X1X1 (48×48×192 → 48×48×256) → Y3Y3.
The same process generates satellite features {M1,M2,M3}{M1,M2,M3}. Final heatmap prediction uses Y3Y3 and M3M3 via bilinear interpolation and convolution.

II.D. Gaussian Window Loss
Standard losses (e.g., triplet loss) fail for cross-view geolocation. We introduce a Gaussian Window Loss to prioritize accuracy near the ground-truth center:

Define a square positive region (33×33 px) centered at the true UAV position.
Assign weights using a 2D Gaussian kernel:

LGauss=−∑m,n12πγ2exp⁡(−(m−mc)2+(n−nc)22γ2)log⁡(P(m,n))LGauss=−m,n∑2πγ21exp(−2γ2(m−mc)2+(n−nc)2)log(P(m,n))

where (mc,nc)(mc,nc) is the center, γγ controls spread, and PP is the predicted heatmap. This sharpens focus vs. flat or Hanning windows (Fig. 2).

*Table 2: Loss Function Ablation (MA@20 / RDS)*

Loss	MA@5 (%)	MA@20 (%)	RDS (%)
Hanning Window	23.48	66.71	67.73
Gaussian Window	26.31	69.58	69.74

III. Experiments
III.A. Dataset & Setup
We use UL14, a dense UAV-satellite dataset:

Training: 6,768 UAV images (80m, 90m, 100m altitudes) + 6,768 satellite images (10 universities).
Testing: 2,331 UAV images + 27,972 satellite images (4 universities, scales covering 180–463 m/edge).

Table 3: UL14 Dataset Composition

Split	UAV Images	Satellite Images	Coverage
Training	6,768	6,768	10 universities
Testing	2,331	27,972	4 universities

Evaluation Metrics:

MA@K: Percentage of predictions within KK meters of truth:

MA@K=1N∑i=1NI(ei<K),ei=Haversine((xipred,yipred),(xigt,yigt)MA@K=N1i=1∑NI(ei<K),ei=Haversine((xipred,yipred),(xigt,yigt)

Relative Distance Score (RDS): Scale-invariant accuracy:

RDS=exp⁡(−12(dxa⋅σ)2−12(dyb⋅σ)2)RDS=exp(−21(a⋅σdx)2−21(b⋅σdy)2)

where dx,dydx,dy are pixel errors, a×ba×b is satellite image size, σ=10σ=10.

Implementation: PyTorch on NVIDIA 1080Ti; Adam optimizer (LR=0.0001, batch=16).

III.B. Results
FCN-FPI outperforms state-of-the-art (SOTA) methods:
Table 4: Comparison with SOTA Models

Model	MA@5 (%)	MA@20 (%)	RDS (%)
FPI [19]	18.63	57.67	57.22
WAMF-FPI [20]	24.94	67.28	65.33
FCN-FPI (Ours)	26.31	69.58	69.74

Key improvements:

MA@20: ↑2.3% vs. WAMF-FPI, ↑11.91% vs. FPI.
RDS: ↑4.41% vs. WAMF-FPI, ↑12.52% vs. FPI.

Visualization confirms precise heatmap concentration near ground truth (Fig. 3).

IV. Conclusion
We present FCN-FPI, a robust vision-based framework for Unmanned Aerial Vehicle geolocation in GNSS-denied settings. By synergizing FocalNet’s multi-scale feature extraction, CLMF’s cross-level fusion, and Gaussian window loss, our method achieves state-of-the-art accuracy on the UL14 benchmark. This advances autonomous UAV navigation in critical applications like urban reconnaissance or emergency response. Future work includes real-time optimization and multi-sensor fusion (e.g., inertial data).