Accurate Estimation Method for UAV Drone Camera Pose Considering Surveillance Scenes

The rapid proliferation of Unmanned Aerial Vehicle (UAV) drone technology has revolutionized numerous fields, offering unprecedented capabilities for remote sensing, data acquisition, and real-time monitoring. UAV drones are equipped with high-resolution cameras and robust communication systems, enabling the rapid capture and transmission of vast amounts of visual and spatial information. This makes them invaluable tools for applications such as natural resource supervision, infrastructure inspection, agricultural monitoring, traffic management, and disaster response. However, a significant challenge persists in conventional UAV drone surveillance applications, especially during multi-scene synchronous monitoring: the spatial disconnection between live video feeds and the broader geographical context. Video feeds are often isolated points of observation, lacking integration with a comprehensive geospatial framework.

Integrating live UAV drone surveillance video with realistic 3D maps or photorealistic scenes can dramatically enhance situational awareness. This fusion provides intuitive spatial context, allowing operators to understand not just what is happening within the camera’s field of view, but also where it is happening relative to surrounding terrain, buildings, and infrastructure. The core technical challenge enabling this real-time fusion is achieving precise spatial registration between the dynamic video stream and the static geospatial data. This registration fundamentally depends on accurately determining the six-degree-of-freedom (6DoF) pose—position (X, Y, Z) and orientation (Heading, Pitch, Roll)—of the UAV drone’s camera for each video frame.

While some existing methods rely heavily on real-time Position and Orientation System (POS) data streamed directly from the UAV drone’s flight controller and onboard GNSS/IMU, this approach has notable limitations. For stationary or hovering UAV drones, the provided POS data typically refers to the antenna’s phase center, not the camera’s projection center, introducing a lever-arm offset. Furthermore, the angular orientation data can be insufficiently accurate for pixel-level registration. Other methods attempt to solve the camera pose solely through feature matching between a video frame and a reference image, but these can struggle with large viewpoint differences or repetitive textures common in aerial views. Therefore, there is a pressing need for a robust, accurate, and hardware-agnostic method to estimate UAV drone camera pose, minimizing dependency on high-quality direct sensor feeds.

This paper presents a novel, scene-adaptive method for the precise estimation of a UAV drone’s camera pose. The core innovation lies in a two-stage, coarse-to-fine estimation strategy that leverages the visual content of the surveillance scene itself. The method first performs an initial coarse alignment using traditional feature matching techniques to obtain approximate pose parameters. Subsequently, it refines this estimate by constructing a localized multi-view scene manifold and employing a highly efficient hash-based image similarity search to identify the optimal pose. Based on this methodological framework, a prototype system for fusing UAV drone surveillance video with realistic 3D maps was developed and evaluated.

Methodology for UAV Drone Camera Pose Estimation

The proposed methodology is designed to operate with minimal reliance on precise, real-time telemetry from the UAV drone. It assumes the availability of a georeferenced image map or a 3D model of the area under surveillance and a live video feed from a stationary or slow-moving UAV drone. The process, illustrated in the following conceptual workflow, involves two sequential stages: coarse pose initialization and fine pose refinement.

The initial stage addresses the problem of large initial misalignment. Given an approximate geographical location from the UAV drone’s GNSS (even if inaccurate), a corresponding view can be rendered from the geospatial database. Let $I_{map}$ be this reference image from the map/model, and $I_{video}^t$ be a keyframe extracted from the live UAV drone video at time $t$. The coarse alignment solves for an initial camera pose $P_{coarse} = [X_c, Y_c, Z_c, H_c, P_c, R_c]^T$. This is achieved through a classic feature-based pipeline:

Feature Detection & Description: Scale-Invariant Feature Transform (SIFT) is applied to both $I_{map}$ and $I_{video}^t$ to detect and describe keypoints. SIFT provides robustness to scale and rotation changes, which are common when matching an aerial map to a potentially tilted UAV drone perspective.
Feature Matching: A Fast Library for Approximate Nearest Neighbors (FLANN) matcher is used to find putative correspondences between the two sets of descriptors.
Outlier Rejection: The Random Sample Consensus (RANSAC) algorithm is employed with a fundamental or homography model to robustly identify and remove incorrect matches (outliers), yielding a set of reliable inlier point correspondences $\{(p_{map}^i, p_{video}^i) | i=1,…,n\}$.
Perspective-n-Point (PnP) Solution: Using a minimum of 4 non-coplanar 3D-2D correspondences (where the 3D points are known from the map/model), the Efficient Perspective-n-Point (EPnP) algorithm computes $P_{coarse}$. EPnP is chosen for its efficiency and stability. It works by expressing 3D points as a weighted sum of four non-coplanar virtual control points in the world coordinate system. The projection equations for a 3D point $P_w^i$ (world coordinates) to a 2D image point $p^i$ are:
$$ w_i \begin{bmatrix} u_i \\ v_i \\ 1 \end{bmatrix} = \mathbf{K} \sum_{j=1}^{4} \alpha_{ij} \begin{bmatrix} x_j^c \\ y_j^c \\ z_j^c \end{bmatrix} $$
where $w_i$ is a projective depth, $\mathbf{K}$ is the camera intrinsic matrix, $(x_j^c, y_j^c, z_j^c)$ are the coordinates of the $j$-th virtual control point in the camera coordinate system, and $\alpha_{ij}$ are the barycentric coordinates. Solving for the camera coordinates of the control points and then aligning them to their known world coordinates yields the final rotation $\mathbf{R}$ and translation $\mathbf{t}$ of $P_{coarse}$.

The resulting $P_{coarse}$ provides a vital initial alignment but often contains residual error insufficient for seamless visual fusion, especially at high resolutions or when the UAV drone’s view is highly oblique.

The second stage aims to correct this residual error. The key insight is that the true camera pose $P_{true}$ lies in a local neighborhood around $P_{coarse}$ in the pose parameter space. We construct a discrete sampling of this neighborhood, a “Multi-View Proximity Set” $S$. This set is generated by systematically perturbing each of the six parameters of $P_{coarse}$ within plausible error bounds (e.g., ±5 meters in X,Y, ±2 meters in Z, ±5 degrees in angles). For each perturbed pose parameter vector $P_k \in S$, a synthetic view $I_k$ is rendered from the 3D map/model.

To efficiently find the pose $P_{opt} \in S$ whose synthetic view $I_{opt}$ best matches the UAV drone video frame $I_{video}^t$, we employ a hash-based similarity measure. Direct pixel-by-pixel or feature-based comparison across hundreds of images is computationally prohibitive for real-time applications. Instead, we use an Average Hash (aHash) algorithm:

Resize both $I_{video}^t$ and all $I_k$ to a small, fixed size (e.g., 8×8 pixels).
Convert the images to grayscale.
Compute the mean pixel intensity $\mu$.
Generate a 64-bit hash $H$ where each bit $b_i$ is:
$$ b_i = \begin{cases} 1 & \text{if } pixel_i \geq \mu \\ 0 & \text{if } pixel_i < \mu \end{cases} $$
This hash is a perceptual fingerprint of the image’s overall luminance structure.

The similarity $sim$ between the video frame hash $H_{video}$ and a synthetic image hash $H_k$ is computed using the Hamming distance $D_H$, which counts the number of differing bits:
$$ sim(H_{video}, H_k) = 1 – \frac{D_H(H_{video}, H_k)}{64} $$
A similarity threshold $\tau$ (e.g., 0.85) is set to filter out poor matches. The pose corresponding to the synthetic image with the highest similarity score above $\tau$ is selected as the refined pose:
$$ P_{refined} = \arg\max_{P_k \in S} sim(H_{video}, H_k), \quad \text{subject to } sim(H_{video}, H_k) > \tau $$
This process allows for rapid, sub-pixel level registration refinement by searching a pre-computed, hash-indexed manifold of nearby views.

Experimental Analysis and System Implementation

To validate the proposed UAV drone camera pose estimation method, a prototype fusion system was developed using the Cesium JavaScript framework for 3D geospatial visualization. The system integrates the two-stage pose estimation pipeline, enabling real-time overlay of a UAV drone’s video feed onto a high-resolution 3D textured mesh model (Digital Twin) of the test area.

Experimental Setup

The test area comprised a complex of university buildings, providing rich visual texture and geometric structure. A DJI Phantom 4 RTK UAV drone was used as the aerial platform. This UAV drone is equipped with a 20-megapixel camera capable of recording 4K video at 30 fps. For the experiment, the UAV drone was commanded to hover at a fixed location, simulating a stationary surveillance point. The 3D model was constructed from prior oblique aerial photography and serves as the precise geospatial reference. The following table summarizes the core parameters of the experiment.

Table 1: Experimental Configuration Parameters
Component	Specification / Detail
UAV Drone Platform	DJI Phantom 4 RTK
Camera Resolution	3840 x 2160 pixels (4K)
Test Area	University campus building complex
Geospatial Reference	Oblique photogrammetry-derived 3D mesh model
Coarse Match Algorithm	SIFT + FLANN + RANSAC + EPnP
Refinement Set Size (\|S\|)	216 synthetically rendered viewpoints
Hash Algorithm	Average Hash (aHash) on 8×8 grayscale images
Similarity Threshold (τ)	0.85

Results and Quantitative Evaluation

The first stage (coarse alignment) successfully initialized the camera pose. The calculated $P_{coarse}$ brought the video frame and the 3D model into rough alignment, but clear parallax and positional errors were visible, particularly at the edges of structures and ground-level features like the pavement-stair interface. The refined stage was then executed. The hash similarity scores between the live UAV drone video keyframe and all 216 synthetic views in set $S$ were computed. The distribution of these scores is critical for assessing the discriminative power of the method.

The highest similarity scores clustered around a specific index range, with three adjacent poses achieving the maximum score of 0.9375, well above the threshold $\tau=0.85$. The refined pose $P_{refined}$ was taken as the average of these top-performing poses. The quantitative improvement is evident in the comparison of the pose parameters before and after refinement.

Table 2: Comparison of Estimated UAV Drone Camera Pose Parameters
Pose Parameter	Coarse Pose ($P_{coarse}$)	Refined Pose ($P_{refined}$)	Change
X (Longitude, °)	103.983 772	103.983 922	+0.000 150
Y (Latitude, °)	30.769 251	30.769 301	+0.000 050
Z (Altitude, m)	508.4	508.4	0.0
Heading (Yaw, °)	-18.8	-21.8	-3.0
Pitch (°)	-89.9	-89.9	0.0
Roll (°)	0.0	0.0	0.0

The most significant correction was a 3.0° adjustment in the Heading angle, which is crucial for aligning linear features like building edges. The visual impact of this refinement is profound. In the coarse fusion, the overlaid video appears slid or sheared relative to the model. After refinement, the video texture snaps precisely onto the 3D geometry; for instance, the lines of pavement joints align seamlessly with the model’s terrain, and the video feed of building facades aligns correctly with the underlying 3D mesh.

To quantify the positional accuracy gain, eight distinct, well-defined feature points (P1 to P8) visible in both the video and the 3D model (e.g., corners of pavement tiles, base of lampposts) were selected. The ground truth position of these points was manually identified in the high-resolution 3D model. The projected screen coordinates of these 3D points were calculated using both $P_{coarse}$ and $P_{refined}$, and their offsets (in meters, translated from pixel error using ground sampling distance) from the true video feature locations were measured.

Table 3: Accuracy Assessment: Feature Point Offset Errors
Feature Point	Offset with Coarse Pose (m)	Offset with Refined Pose (m)	Error Reduction (%)
P1	2.34	0.22	90.6
P2	1.82	0.23	87.4
P3	1.63	0.17	89.6
P4	1.99	0.15	92.5
P5	1.68	0.15	91.1
P6	1.37	0.16	88.3
P7	1.76	0.18	89.8
P8	2.42	0.21	91.3
Average	1.88 m	0.18 m	90.4%

The results are conclusive. The coarse pose estimation yielded an average registration error of 1.88 meters, which is unacceptable for detailed monitoring or annotation tasks. The proposed refinement stage reduced this error by over 90%, achieving a sub-decimeter average accuracy of 0.18 meters. This level of precision enables a visually seamless and spatially accurate fusion of the live UAV drone feed with the 3D environment.

Conclusion

This paper has presented and validated a novel, two-stage method for the accurate estimation of a UAV drone’s camera pose, specifically designed for enhancing the fusion of surveillance video with realistic 3D maps. The method strategically decouples the problem: first, a robust feature-based coarse alignment corrects for large initial displacements, and second, an efficient hash-based similarity search within a local pose manifold refines the alignment to sub-pixel accuracy. The core advantage of this approach is its reduced dependency on highly accurate, real-time telemetry from the UAV drone’s flight controller, making it more versatile and applicable to a wider range of platforms and scenarios.

The experimental implementation within a Cesium-based fusion system demonstrated the method’s practical efficacy. Quantitative evaluation showed that the refinement stage consistently reduced spatial registration errors from an average of nearly two meters to below twenty centimeters—a critical improvement for applications requiring precise geo-localization of video content. The integration of the live UAV drone video into the digital twin of the environment was visually seamless, significantly enhancing spatial context and operational awareness. Future work will focus on optimizing the hash algorithm for different environmental conditions (e.g., lighting changes, seasonal variations), extending the method to dynamic UAV drone trajectories, and integrating deep learning-based features to improve the robustness and discriminative power of the coarse matching stage. This methodology provides a solid and effective technical foundation for next-generation integrated surveillance and spatial data visualization systems powered by UAV drone technology.