Hierarchical Feature Fusion for Multi-Stage Cross-View Camera Drone Localization

Accurate geolocation of camera drones in GNSS-denied environments remains challenging due to signal obstructions in urban canyons or dense forests. This work presents a training-free framework for cross-view matching between camera UAV imagery and satellite databases, mimicking human visual cognition from holistic attributes to fine-grained details. Our method eliminates dependency on pair-wise training data while maintaining competitive accuracy and significantly improving computational efficiency.

We establish a three-stage pipeline: 1) Semantic segmentation using VGG16-Unet extracts building regions from camera drone images, followed by morphological refinement; 2) RGB histogram filtering with Bhattacharyya distance rapidly eliminates dissimilar satellite candidates; 3) SuperPoint feature extraction and LightGlue matching perform precise alignment. For building segmentation, morphological operations address common artifacts:

$$ \text{Morphological Processing} = \text{FillHoles}(\text{Mask}) + \text{RemoveSmallRegions}(\text{Mask}, \tau) $$

where $\tau$ denotes the area threshold (e.g., 10,000 pixels for 1024×1024 images). Color similarity between camera UAV images and satellite candidates is quantified using normalized RGB histograms $H_1$, $H_2$:

$$ BC(H_1, H_2) = \sum_{i=1}^{n} \sqrt{H_1(i) \cdot H_2(i)} $$
$$ BD(H_1, H_2) = -\ln \left( BC(H_1, H_2) \right) $$

Matches with $BD > 0.5$ are discarded before feature-point processing. SuperPoint detects keypoints in remaining candidates, while LightGlue’s adaptive confidence classifier enables efficient matching:

$$ \text{LightGlue} = \underset{\text{Matches}}{\mathrm{argmax}} \left( \text{Sinkhorn}\left( \text{GNN}_{\text{light}}(F_{\text{drone}}, F_{\text{sat}}) \right) \right) $$

where $F_{\text{drone}}$ and $F_{\text{sat}}$ denote feature descriptors from the camera drone and satellite images, respectively. LightGlue’s pruning mechanism reduces computational load by progressively eliminating low-confidence points.

Method	Recall@1 (%)	AP (%)	Training Required
Contrastive Loss	52.39	57.44	Yes
Triplet Loss	55.18	59.97	Yes
Instance Loss	58.23	62.91	Yes
LPN	75.93	79.14	Yes
Ours (Camera Drone)	71.24	76.03	No

Evaluated on University-1652 (37,855 camera UAV images and 951 satellite images), our training-free approach achieves competitive performance. The hierarchical design reduces processing time by 45% compared to full feature matching:

Processing Stage	Avg. Time (s)	Time Reduction
Full Feature Matching (Brute-force)	13.7	0%
Semantic Segmentation + Morphology	0.9	–
RGB Histogram Filtering	1.5	–
SuperPoint + LightGlue	5.1	–
Total (Camera UAV Pipeline)	7.5	45.3%

Critical advantages emerge during camera UAV deployment: 1) Morphological processing increases building mask coherence by 18.7% measured by boundary continuity metrics; 2) RGB filtering discards 30% of satellite candidates before compute-intensive matching; 3) LightGlue’s pruning reduces feature points by 40-60% in simple scenarios. This efficiency enables real-time operation on embedded camera drone systems where computational resources are constrained.

For practical camera UAV applications, our method demonstrates robustness against perspective variations exceeding 60° in pitch and yaw. When camera drones capture low-altitude imagery under tree occlusion, building segmentation maintains 89.2% accuracy versus 76.4% for whole-image approaches. The pipeline processes each camera UAV image in 0.4s average during satellite database queries, making it suitable for emergency response scenarios.

Future work will integrate lightweight transformer modules to handle extreme viewpoint differences and extend the framework to oblique aerial imagery. The current implementation provides a viable solution for camera drone localization in GNSS-compromised environments without requiring curated training data, lowering deployment barriers for urban monitoring and disaster assessment missions.