Online Depth Estimation and Mesh Reconstruction for UAV Imagery Integrating TSDF and Global Planar Surfaces

Abstract

We propose an online depth estimation and mesh reconstruction method for UAV drones that integrates a truncated signed distance function (TSDF) with global planar priors. The method first raycasts a downsampled prior depth and normal map from the TSDF field fused from historical frames, enabling planar homography-based multi-view matching cost computation to improve accuracy in regions with large disparity variations. A cost ratio-based strategy is then employed during cost fusion to eliminate mismatched costs in occluded areas, enhancing the reliability of the fused matching cost. Subsequently, a semi-global cost aggregation algorithm yields an initial downsampled depth map. To obtain high-resolution depth maps, we incrementally fuse global planar priors with the initial depth to optimize the matching cost of original-resolution images, significantly improving depth accuracy in low-texture and repetitive-texture regions. The initial depth also constrains the depth search range, reducing computational time for cost calculation and aggregation. Finally, the optimized depth map is incrementally fused into the TSDF field to extract the surface mesh. Experiments on four typical UAV imagery datasets demonstrate that the proposed method generates accurate and complete depth maps and meshes. The mean absolute error of depth maps and meshes is less than 0.232 m and 0.196 m, respectively. Compared to Mobile3DRecon, the most accurate baseline, accuracy is improved by 19.2% and 19.3%. The average per-frame reconstruction time is reduced by 39.9% and 32.1% compared to PatchmatchNet and Mobile3DRecon, achieving an online reconstruction rate better than one second per frame. The method effectively improves depth and mesh quality in low-texture, repetitive-texture, and large-disparity regions, meeting the requirements of online UAV imaging for applications such as natural disaster emergency response.

1. Introduction

UAV drones have become essential platforms for acquiring 3D spatial information in photogrammetry and remote sensing due to their low cost and operational flexibility. They are widely used in high-precision mapping, smart city planning, and disaster emergency response. Traditional offline 3D reconstruction methods using commercial software can achieve centimeter-level accuracy but are time-consuming and cannot meet the real-time demands of emergency tasks. Online 3D reconstruction based on multi-view depth maps offers a promising solution.

Most existing online methods rely on RGB-D or laser sensors, which are costly and difficult to deploy on UAV drones. Gallup et al. proposed plane-sweeping stereo for real-time depth estimation, but it suffers from noise and incomplete results. Ondrúška et al. used multi-frame block matching followed by TSDF fusion and marching cubes, yet depth quality is limited. Mobile3DRecon achieved real-time monocular reconstruction on mobile devices using semi-global matching and incremental TSDF, but it degrades in low-texture and large-disparity areas. Scharstein et al. showed that incorporating surface orientation priors improves semi-global matching, but obtaining such priors online remains challenging. Roth et al. introduced normal optimization by iteratively fitting planes, which improves wide-baseline scenarios but at the cost of high computation. Deep learning methods like PatchmatchNet balance efficiency and accuracy but are sensitive to training data and often struggle with large disparities and weak textures.

To address these challenges, we propose a method that leverages the TSDF field from historical frames to generate prior depth and normals via raycasting. These priors guide the computation of planar homography-based matching costs, improving accuracy in large-disparity regions. We also design an occlusion-aware cost fusion strategy to reject incorrect matches. Furthermore, global planar surfaces are incrementally fused to optimize the original-resolution matching cost, enhancing depth in low-texture areas while using the initial depth to constrain the search range for efficiency. The refined depth is then fused into the TSDF field for incremental mesh reconstruction. Experiments on four typical UAV datasets validate the effectiveness and efficiency of the proposed method.

2. Methodology

2.1 Downsampled Depth Map Computation

The overall framework consists of three main parts: downsampled depth map computation, global planar-guided depth optimization, and incremental mesh reconstruction. To achieve an online rate better than one second per frame (UAV cameras typically take images at intervals >1s), the depth estimation and TSDF fusion modules are designed to run in parallel. For a new frame requiring depth estimation, the prior depth and normal maps are raycast from the latest fused TSDF field using the frame’s pose. Despite possible incompleteness in the raycasted maps due to the parallel process, this avoids the latency and waiting caused by serial execution.

To compute multi-view matching costs, we adopt a cost volume computation method in depth space similar to Yang et al. The search range $ [z_{\min}, z_{\max}] $ is discretely sampled with $ L = 64 $ levels. The depth value at the $ l $-th sample is given by:

$$
z_l = \frac{z_{\min} z_{\max}}{z_{\min} + l (z_{\max} – z_{\min}) / (L-1)}, \quad l=0,1,\dots,L-1
$$

Using the camera poses estimated by ORB-SLAM3 with RTK loosely coupled, each sampled depth is projected onto neighboring reference frames to obtain matching pixel coordinates:

$$
\begin{cases}
\mathbf{p’}_{f_j} = z_l \mathbf{h}_{f_j}^{f_i} \mathbf{p} + \Delta_{f_j}^{f_i} \\
\mathbf{h}_{f_j}^{f_i} = \mathbf{K}_{f_j} \mathbf{R}_{f_j}^T \mathbf{R}_{f_i} \mathbf{K}_{f_i}^{-1} \\
\Delta_{f_j}^{f_i} = \mathbf{K}_{f_j} \mathbf{R}_{f_j}^T (\mathbf{T}_{f_i} – \mathbf{T}_{f_j})
\end{cases}
$$

For a pixel $ \mathbf{p} $ on the current frame $ f_i $, the matching cost at sampled depth $ z_l $ is computed using a center-symmetric Census transform over an $ n \times m $ window:

$$
c_{f_j}^{f_i}(u,v,z_l) = \text{Hamming}\left( \bigotimes_s (I_{f_i}(u\pm i, v\pm j)), \bigotimes_s (I_{f_j}(u’ \pm i, v’ \pm j)) \right)
$$

To handle large disparity variations, we incorporate prior depth and normals from the TSDF field via raycasting. The planar prior equation for pixel $ \mathbf{p} $ is $ n_x x + n_y y + n_z z + z_l = 0 $. The homography matrix between the current frame and a reference frame is computed as:

$$
\mathbf{H}_{f_j}^{f_i}(\mathbf{p}, \mathbf{p’}_{f_j}, z_l) = \mathbf{K}_{f_j} \left( \mathbf{R}_{f_i \to f_j} – \frac{\mathbf{t}_{f_i \to f_j} \mathbf{n}_p^T}{z_l} \right) \mathbf{K}_{f_i}^{-1}
$$

The cost is then computed with image warping using this homography:

$$
c_{f_j}^{f_i}(u,v,z_l) = \text{Hamming}\left( \bigotimes_s (I_{f_i}(u\pm i, v\pm j)), \bigotimes_s (I_{f_j}[\mathbf{H}_{f_j}^{f_i}(u’\pm i, v’\pm j, 1)]) \right)
$$

For pixels without raycasted priors, the original cost formula is used. After obtaining two matching costs from left and right neighbors, we use a cost ratio-based occlusion handling strategy:

$$
\begin{cases}
C(p,z_l) = w_{f_{i-1}}^{f_i} c_{f_{i-1}}^{f_i} + w_{f_{i+1}}^{f_i} c_{f_{i+1}}^{f_i} \\
\text{if } \exists c_{f_j}^{f_i} < \hat{c} \text{ and } \frac{|c_{f_{i-1}}^{f_i} – c_{f_{i+1}}^{f_i}|}{c_{f_{i-1}}^{f_i} + c_{f_{i+1}}^{f_i}} > 0.5 \\
w_{f_{i-1}}^{f_i} = 0.5 – 0.5 \times \frac{c_{f_{i-1}}^{f_i} – c_{f_{i+1}}^{f_i}}{c_{f_{i-1}}^{f_i} + c_{f_{i+1}}^{f_i}}
\end{cases}
$$

Here $ \hat{c}=5 $. This adaptive weighting reduces the influence of occluded pixels. The fused cost is aggregated using Hirschmuller’s semi-global optimization over 8 directions, and the optimal discrete depth index is found via Winner-Take-All (WTA) plus parabolic interpolation:

$$
\tilde{l}_p = \min_l \sum_{r=1}^8 L_r(p, l_p)
$$

Table 1: Edge error statistics of depth maps at different stages (m)
Error Metric	Without occlusion & prior	+ occlusion optimization	Final downsampled depth
Completeness	0.8238	0.8892	0.9439
Edge error (MAE)	0.1730	0.1560	0.1380

As shown in Table 1, the final downsampled depth with occlusion and prior guidance reduces edge error by 20.2% compared to the baseline without optimization, and completeness improves by 14.6%.

2.2 Global Planar-Guided Depth Optimization

Although the downsampled depth map incorporates multi-view information, weak-texture and repetitive-texture regions still suffer from low accuracy. We therefore propose a global planar-guided depth optimization method. First, we extract planar information from the downsampled depth map of the current frame using Proença et al.’s method, obtaining plane equations $ (\mathbf{n}_\pi, p_\pi) $ and contour point sets $ \beta $. For each new frame, we incrementally fuse the plane into a global plane database. The center distance and plane distance are computed; if both are below thresholds, the plane normal is updated via weighted averaging, and contour points are merged by projecting 3D points onto the plane and using OpenCV’s ConvexHull detection.

The optimized matching cost is then computed as:

$$
\hat{C}(p, z_l) = C(p, z_l) \times \left(1 – e^{-(z_l – \tilde{z}_l)^2 / \lambda_z} \cdot e^{-\|\mathbf{n}_\pi^T \tilde{\mathbf{n}}_p\| / \lambda_n}\right)
$$

where $ \tilde{z}_l $ and $ \tilde{\mathbf{n}}_p $ are the prior depth and normal from the downsampled map, $ \mathbf{n}_\pi $ is the global plane normal, and $ \lambda_z=0.5, \lambda_n=0.5 $. This formulation enforces depth consistency with both prior and plane priors, enhancing smoothness and accuracy in low-texture areas.

To reduce computation, the depth search range is constrained around the prior depth index $ \tilde{l}_p $ with a margin $ \delta l = 8 $:

$$
\begin{cases}
\hat{l}_p^{\max} = \tilde{l}_p + \delta l \\
\hat{l}_p^{\min} = \tilde{l}_p – \delta l
\end{cases}
$$

During semi-global optimization, we incorporate a disparity jump step $ \Delta L = \text{round}(\hat{l}_p – \hat{l}_q) $ to penalize large changes between adjacent pixels. The cost aggregation formula becomes:

$$
L_r(p, \hat{l}_p) = \hat{C}(p, \hat{z}_l) + \min_{\hat{l}_p \in [\hat{l}_p^{\min}, \hat{l}_p^{\max}]} \left( L_r(p-r, \hat{l}_q) + \psi(\hat{l}_p, \hat{l}_q) \right) – \min_{k \in [\hat{l}_p^{\min}, \hat{l}_p^{\max}]} L_r(p-r, k)
$$

with the smoothness term:

$$
\psi(\hat{l}_p, \hat{l}_q) =
\begin{cases}
0 & \hat{l}_p + \Delta L = \hat{l}_q \\
P_1 & |\hat{l}_p + \Delta L – \hat{l}_q| = 1 \\
P_2 & |\hat{l}_p + \Delta L – \hat{l}_q| > 1
\end{cases}
$$

After optimization, WTA and parabolic fitting yield the final high-resolution depth map.

2.3 Incremental Mesh Reconstruction

We adopt an incremental TSDF fusion method for online mesh reconstruction. For each pixel with refined depth $ \bar{z}_l $, its 3D position $ \mathbf{P}_V $ is computed as:

$$
\mathbf{P}_V = \bar{z}_l \mathbf{K}_{f_i}^{-1} \mathbf{R}_{f_i} \mathbf{p} + \mathbf{T}_{f_i}
$$

The depth difference $ \delta z = \bar{z}_l – z_V $ is compared to a truncation distance $ \tau $. If $ |\delta z| < \tau $, the local TSDF value is computed as:

$$
\text{tt}(V) = \text{clamp}\left( \frac{\delta z}{\tau}, -1, 1 \right)
$$

The global TSDF value and weight are updated incrementally:

$$
\begin{cases}
T_t(V) = \frac{T_{t-1}(V) \cdot W_{t-1}(V) + \text{tt}(V) \cdot W_t(V)}{W_{t-1}(V) + W_t(V)} \\
W_t(V) = W_{t-1}(V) + 1
\end{cases}
$$

Finally, the Marching Cubes algorithm extracts the zero-level isosurface to produce the mesh.

3. Experiments and Analysis

3.1 Datasets

We evaluate our method on four typical UAV imagery datasets covering urban buildings, mountainous terrain, residential areas, and rural regions. The images include weak textures, repetitive patterns, and large disparity variations. Ground truth depth and mesh are generated using offline commercial software (MetaShape). For two datasets with ground control points, absolute accuracy is assessed. Table 2 summarizes the dataset statistics.

Table 2: Dataset statistics of four UAV imagery groups
Dataset	UAV model	GSD (cm)	Resolution (pixels)	Checkpoints	Images
Urban	DJI Phantom 4	1.67	4000×3000	–	85
Mountain	eBee Classic	5.40	5472×3648	9	347
Residential	eBee X	2.39	3648×5472	–	101
Rural	DJI Phantom 4	1.88	2736×1824	13	482

Our implementation runs on Ubuntu 22.04 with C++, GPU acceleration (OpenCL) for TSDF fusion and raycasting. Hardware: Intel i7-13700KF CPU, NVIDIA RTX4080. We compare against PatchmatchNet and Mobile3DRecon, using the same TSDF mesh extraction for fairness. For PatchmatchNet, we bundle its depth output with BundleFusion’s TSDF.

3.2 Depth Estimation Evaluation

Qualitative visual comparisons show that our depth maps are closer to ground truth, especially in building rooftops and rural roofs (weak texture, repetitive pattern). Figure 5 (qualitative) indicates that our method produces sharper edges and more complete depth in large-disparity areas, while competitors exhibit blur and noise. Quantitative metrics (Table 3) after 3-sigma outlier removal confirm that our approach achieves the lowest absolute relative error (Abs Rel), mean absolute error (MAE), and root mean square error (RMSE) across all datasets.

Table 3: Quantitative depth evaluation results for four datasets
Dataset	Method	Abs Rel (%)	MAE (m)	RMSE (m)
Urban	PatchmatchNet Mobile3DRecon Ours	0.5571 0.5223 0.4027	0.1679 0.1583 0.1221	0.2426 0.2271 0.1695
Mountain	PatchmatchNet Mobile3DRecon Ours	0.1722 0.1398 0.1125	0.3516 0.2874 0.2323	0.4054 0.3074 0.2656
Residential	PatchmatchNet Mobile3DRecon Ours	0.3098 0.3418 0.2491	0.1957 0.2251 0.1398	0.2375 0.2658 0.1624
Rural	PatchmatchNet Mobile3DRecon Ours	0.1975 0.1527 0.1231	0.1357 0.1102 0.1039	0.1723 0.1548 0.1357

3.3 Mesh Reconstruction Evaluation

Figure 6 (qualitative) shows that our meshes are significantly better than baselines in preserving building edges and terrain details, with fewer holes and less noise. Quantitative results (Table 4) report MAE, completeness, F1-score, standard deviation, and checkpoint MAE. Our method achieves the best F1-score and MAE across all datasets.

Table 4: Quantitative mesh evaluation results
Dataset	Method	Completeness	F1-Score	MAE (m)	STD (m)	Checkpoint MAE (m)
Urban	PatchmatchNet Mobile3DRecon Ours	0.8970 0.8587 0.8694	0.8576 0.8973 0.9285	0.1611 0.1380 0.1100	0.1624 0.1396 0.1192	– – –
Mountain	PatchmatchNet Mobile3DRecon Ours	0.9549 0.9190 0.9345	0.8306 0.8763 0.9075	0.3853 0.2430 0.1961	0.2977 0.2455 0.2102	0.2455 0.2397 0.1980
Residential	PatchmatchNet Mobile3DRecon Ours	0.9565 0.7742 0.9286	0.8714 0.8267 0.9182	0.2963 0.3143 0.1578	0.2665 0.3070 0.1545	– – –
Rural	PatchmatchNet Mobile3DRecon Ours	0.9692 0.9517 0.9529	0.9229 0.9383 0.9442	0.1160 0.0813 0.0782	0.1116 0.0851 0.0808	0.0920 0.0785 0.0694

The error distribution maps (Figures 7 and 8) confirm that our method consistently produces lower absolute errors, with the majority of points within tight thresholds. Larger errors only appear at building edges and dense vegetation due to occlusion and view changes.

3.4 Performance Evaluation

Table 5 shows per-stage timings for a resolution of 800×600 pixels. The overall per-frame depth estimation is under 11 ms, and TSDF fusion under 5 ms, allowing parallel execution. The Marching Cubes step takes ~546 ms, but overall single-frame processing remains below 1 second. Table 6 compares total depth and mesh reconstruction timings across datasets. Our method is 1.66× faster than PatchmatchNet and 1.47× faster than Mobile3DRecon on average, while achieving better accuracy.

Table 5: Per-stage timings for the proposed method (ms)
Stage	Time (ms)
Downsampled depth computation	2.96
Planar-guided depth optimization	5.39
TSDF fusion	4.13
Single-frame mesh extraction	546.2

Table 6: Average per-frame depth and mesh reconstruction time (ms)
Dataset	PatchmatchNet	Mobile3DRecon	Ours
Urban	73.0 / 857.2	12.6 / 819.3	8.35 / 546.0
Mountain	74.0 / 768.5	11.1 / 715.2	7.58 / 491.6
Residential	71.0 / 757.6	10.2 / 712.8	7.43 / 489.6
Rural	93.0 / 985.6	25.76 / 957.8	10.68 / 654.9

4. Conclusion

We have proposed an online depth estimation and mesh reconstruction method for UAV drones that integrates TSDF with global planar priors. By raycasting prior depth and normals from historical TSDF fields, we compute plane-guided multi-view matching costs that improve accuracy under large disparity variations. An occlusion-aware cost fusion strategy further rejects erroneous matches. The global planar prior is incrementally fused to optimize the matching cost at original resolution, significantly boosting depth quality in low-texture and repetitive-texture regions while reducing computational cost through constrained search. Incremental TSDF fusion enables online mesh generation. Experiments on four diverse UAV datasets demonstrate that our method achieves the best depth and mesh accuracy, with MAE below 0.232 m and 0.196 m respectively, outperforming Mobile3DRecon by 19.2% and 19.3%. The average reconstruction time is under 1 second per frame, meeting the online requirement for UAV applications. Future work will explore adaptive truncation thresholds and learning-based refinements to further improve robustness and accuracy in challenging scenes.