The proliferation of Unmanned Aerial Vehicles (UAV drones) and advancements in computer vision have established drone-captured imagery as a cornerstone for detailed 3D modeling. These models are pivotal across numerous fields including digital cities, forestry management, and cultural heritage preservation, serving functions from 3D visualization and analysis to safety monitoring and decision support. Traditional image-based 3D reconstruction, rooted in multi-view stereo (MVS) vision, typically involves a pipeline of feature extraction, structure-from-motion (SfM), and dense matching. While effective, these methods can struggle with texture-less or repetitive-texture regions, often resulting in models with missing details, artifacts, or holes. Recent breakthroughs in neural implicit representations, initiated by Neural Radiance Fields (NeRF), offer a paradigm shift. By representing a scene as a continuous volumetric function approximated by a multilayer perceptron (MLP), NeRF synthesizes highly realistic novel views. However, standard NeRF models are computationally intensive and are fundamentally designed for view synthesis, making the direct extraction of accurate, high-fidelity 3D surface models challenging.
To address surface reconstruction explicitly, subsequent research has focused on neural implicit surface representations, primarily through occupancy fields and Signed Distance Functions (SDF). SDF-based methods have demonstrated superior capability for high-precision surface reconstruction. A significant milestone in this domain is Bakedangelo, a modular framework built within SDFStudio that synthesizes strengths from several models like VolSDF, BakedSDF, and Neuralangelo. It employs hash encoding for efficient feature representation and a “baking” strategy, achieving state-of-the-art mesh reconstruction from multi-view images. Despite its strengths, we observe that the standard Bakedangelo model does not fully exploit geometric consistency across different drone viewpoints and overlooks the local characteristics inherent in its ray-sampling strategy. When applied to complex UAV drone scenes—characterized by varying scales, occlusions, and repetitive structures—this can lead to an insufficient recovery of fine surface details.

This paper proposes an enhanced Bakedangelo method specifically tailored for high-precision 3D reconstruction from UAV drone imagery. Our core contributions are threefold: First, we integrate a multi-view geometric consistency loss to enforce 3D correspondence between pixels across different drone viewpoints. Second, we incorporate a Stochastic Structural Similarity (S3IM) loss that leverages the local information from the ray-sampling process, moving beyond global image similarity measures. Third, we refine the network’s activation function to better preserve scene continuity, particularly for signed distance values. Comprehensive experiments on both public and custom UAV drone datasets demonstrate that our method reconstructs 3D meshes with significantly enhanced geometric detail while maintaining overall reconstruction quality and completeness.
Related Work
Neural Radiance Fields (NeRF) and View Synthesis. The seminal NeRF work parameterizes a continuous volumetric scene using an MLP, mapping 3D coordinates and viewing direction to volume density and view-dependent color. Through volumetric rendering, it achieves photorealistic novel view synthesis. However, the extracted geometry, often represented by density thresholds, tends to be noisy and imprecise. Subsequent works like Mip-NeRF and NeRF++ improved anti-aliasing and unbounded scene representation but did not fundamentally change the focus from density-based volume to explicit surface.
Neural Implicit Surface Reconstruction. A pivotal direction shifts from modeling density to modeling the underlying surface geometry. Occupancy networks represent space as a probability of occupancy. More directly, SDF-based methods model the signed distance from any 3D point to the nearest surface, where the zero-level set defines the surface. NeuS bridged volume rendering and SDF by devising a density transformation from SDF. VolSDF further refined this with a probabilistic framework. These methods enable high-quality surface extraction via Marching Cubes but are often slow to train and converge.
Acceleration and Regularization for SDFs. Recent works focus on efficiency and fidelity. Neuralangelo combined multi-resolution hash encoding—inspired by Instant-NGP—with SDF-based reconstruction and employed coarse-to-fine regularization using numerical second-order derivatives to recover fine details. BakedSDF introduced a progressive training scheme and a contraction of world space to handle unbounded scenes common in UAV drone captures. The Bakedangelo model integrates these advances: it uses the hash-accelerated architecture and contraction space from BakedSDF, and incorporates the curvature regularization from Neuralangelo, establishing a strong baseline for high-resolution mesh reconstruction from complex imagery.
Leveraging Multi-view and Local Information. Prior works have explored additional constraints. Geo-NeuS enforced multi-view geometry consistency by minimizing photometric differences between patches projected from estimated surfaces. For ray-based rendering, S3IM was introduced to compute Structural Similarity (SSIM) on randomly sampled image patches from rendering rays, providing a loss that respects the local nature of the sampling process. Our method is inspired by these ideas but integrates them into the robust Bakedangelo framework specifically for the challenges posed by UAV drone data.
Methodology
Our enhanced pipeline, depicted in the following overview, takes posed UAV drone images as input. The core of our approach modifies the Bakedangelo framework by integrating novel loss functions and refining the network architecture to better capture fine details and ensure geometric consistency.
Foundation: SDF-based Neural Implicit Scene Representation
Given a set of calibrated UAV drone images with known camera poses, we aim to learn a continuous implicit function representing the 3D scene. Unlike NeRF which models volume density $\sigma$, we model the Signed Distance Function $f(\mathbf{x}) \in \mathbb{R}$, where $f(\mathbf{x})=0$ defines the surface, $f(\mathbf{x})>0$ denotes space outside the object, and $f(\mathbf{x})<0$ denotes inside. This SDF is transformed into a density $\sigma$ for volumetric rendering via a logistic density distribution function $\Psi$:
$$ \sigma(\mathbf{x}) = \Psi_s(f(\mathbf{x})) = \frac{ \exp(-s \cdot f(\mathbf{x})) }{ (1 + \exp(-s \cdot f(\mathbf{x})))^2 } \cdot s $$
where $s$ is a learnable parameter. A color network $c(\mathbf{x}, \mathbf{d})$ predicts the view-dependent color at point $\mathbf{x}$ in direction $\mathbf{d}$. A pixel color $\hat{C}(\mathbf{r})$ for a ray $\mathbf{r}(t)=\mathbf{o} + t\mathbf{d}$ is rendered by approximating the volume rendering integral:
$$ \hat{C}(\mathbf{r}) = \sum_{i=1}^N T_i \alpha_i c_i, \quad \text{with} \quad T_i = \exp\left( -\sum_{j=1}^{i-1} \sigma_j \delta_j \right), \quad \alpha_i = 1 – \exp(-\sigma_i \delta_i) $$
where $\delta_i$ is the distance between adjacent samples. Bakedangelo operates in a contracted space to handle the unbounded nature of outdoor UAV drone scenes and utilizes multi-resolution hash encoding for efficient and expressive feature representation of spatial coordinates.
Enhanced Loss Formulation
The training of the network is governed by a composite loss function. The baseline Bakedangelo loss $L_{base}$ includes the standard color reconstruction loss $L_{rgb}$, the Eikonal loss $L_{eikonal}$ to regularize the SDF gradient, an interlevel loss $L_{interlevel}$ for consistency between coarse and fine sampling, and a curvature loss $L_{curvature}$ to penalize high-frequency noise in the SDF.
$$ L_{base} = L_{rgb} + L_{interlevel} + \alpha L_{eikonal} + \beta L_{curvature} $$
We enhance this formulation by integrating two key constraints tailored for UAV drone imagery.
Multi-view Geometric Consistency Loss ($L_{multi}$)
To enforce that a 3D point on the surface projects to photometrically consistent image regions across different UAV drone viewpoints, we adopt a patch-based Normalized Cross-Correlation (NCC) loss. For a rendered pixel $i$ in image $I_r$, we extract an $11 \times 11$ patch $q_i$ centered on it. This patch is projected into all source images $I_s$ using the current depth estimate and camera poses to obtain corresponding patches $q_{ij}$. The similarity between the rendered patch and a source patch is measured by NCC:
$$ \text{NCC}(I_r(q_i), I_s(q_{ij})) = \frac{\text{Cov}(I_r(q_i), I_s(q_{ij}))}{\sqrt{\text{Var}(I_r(q_i)) \text{Var}(I_s(q_{ij}))}} $$
To handle occlusions, we only consider the top-K (K=4) highest NCC scores for each rendered patch. The multi-view consistency loss is then the average of (1 – NCC) over all sampled pixels and their top-K source patches:
$$ L_{multi} = \frac{1}{4N} \sum_{i=1}^{N} \sum_{j=1}^{4} \left( 1 – \text{NCC}(I_r(q_i), I_s(q_{ij})) \right) $$
This loss encourages the network to develop a geometrically coherent surface that aligns with photometric evidence from multiple UAV drone perspectives.
Stochastic Structural Similarity Loss ($L_{S3IM}$)
Standard image-level losses like SSIM do not align with the local ray-sampling nature of NeRF-style training. The S3IM loss operates on randomly sampled patches from the batch of rays. For a training iteration, we randomly group $B$ rays and form corresponding rendered and ground truth image patches $P^{(m)}(\hat{C})$ and $P^{(m)}(C)$. We compute the SSIM on these patches (using a $4 \times 4$ kernel and stride 4). This process is repeated $M$ times per iteration, and the final loss is defined as:
$$ \text{S3IM} = \frac{1}{M} \sum_{m=1}^{M} \text{SSIM}(P^{(m)}(\hat{C}), P^{(m)}(C)) $$
$$ L_{S3IM} = 1 – \text{S3IM} $$
This loss provides a powerful, locally-aware image quality constraint that helps recover finer textural and structural details often present in UAV drone imagery of built environments.
Activation Function Optimization
The original Bakedangelo model uses the ReLU activation function in its color network. While computationally efficient, ReLU can cause “dying neurons” when inputs are negative, potentially hindering learning and reducing the continuity of the learned scene representation. The SDF values, which are inputs to parts of the color network, can be both positive and negative. We propose to replace all ReLU activations in the color network with the Softplus function, defined as:
$$ \text{Softplus}(x) = \ln(1 + e^{x}) $$
Softplus is a smooth, differentiable approximation of ReLU that provides non-zero gradients for negative inputs. This change promotes better gradient flow during backpropagation, leading to a more continuous and detailed scene representation without impacting training speed significantly.
Total Loss and Mesh Extraction
The total loss for our enhanced model is a weighted sum of all components:
$$ L_{total} = L_{rgb} + L_{interlevel} + \alpha L_{eikonal} + \beta L_{curvature} + \gamma L_{multi} + \delta L_{S3IM} $$
where $\alpha, \beta, \gamma, \delta$ are hyperparameters balancing the different terms. After training convergence, the 3D mesh is extracted from the learned SDF field $f(\mathbf{x})$ by applying the Marching Cubes algorithm on its zero-level set within a defined bounding volume. The resulting mesh is then transformed back to the real-world coordinate system using the inverse of the initial pose normalization transformation.
Experiments and Evaluation
Datasets and Implementation Details
We evaluate our method on three UAV drone datasets to demonstrate its effectiveness across different scales and scene complexities.
| Dataset | Source | # Images | Image Resolution | Scene Description |
|---|---|---|---|---|
| Cuhk1 | U-Scene [26] | 119 | 2735 × 1819 | University campus (medium scale) |
| Cuhk4 | U-Scene [26] | 51 | 1368 × 912 | University campus (closer scale) |
| CSNL | Custom | 118 | 1000 × 1500 | University building & surroundings |
The Ground Truth (GT) for the public datasets is derived from registered 3D laser scans. For our custom CSNL dataset, GT is obtained by fusing UAV-borne and terrestrial LiDAR point clouds.
We compare against several baselines: 1) Metashape MVS: A leading commercial photogrammetry software. 2) DUSt3R: A recent end-to-end deep learning model for 3D reconstruction from images. 3) NeuS: A foundational SDF-based neural implicit model. 4) BakedSDF: A precursor to Bakedangelo. 5) Bakedangelo: The current state-of-the-art baseline our method builds upon.
All neural methods are trained for 700k iterations on a single NVIDIA A100 GPU. We use a ray batch size of 2048, 48 samples per ray, and a marching cubes resolution of 20483 for mesh extraction. The hyperparameters in the loss are set empirically: $\alpha=0.1, \beta=0.01, \gamma=0.1, \delta=0.1$.
Evaluation Metric
Since our primary goal is accurate geometric reconstruction, we employ the Cloud-to-Mesh (C2M) distance as the main evaluation metric. For a GT point cloud $P$ with $N$ points, the C2M measures the average absolute distance from each GT point to the closest surface on the reconstructed mesh $M$:
$$ \text{C2M} = \frac{1}{N} \sum_{\mathbf{p} \in P} | \text{Dist}(\mathbf{p}, M) | $$
A lower C2M value indicates better geometric fidelity of the reconstructed mesh to the GT data.
Results and Analysis
Qualitative Results
The novel view synthesis results, as shown in the comparative figure, reveal clear differences. The baseline NeuS and BakedSDF models produce blurry renderings, particularly in texture-less areas like building roofs. The standard Bakedangelo model yields sharp images overall but still falters on fine, repetitive textures (e.g., arched roof panels). Our enhanced method successfully renders these challenging details with higher fidelity.
The 3D mesh reconstructions provide the most compelling evidence of our method’s superiority. The MVS result is complete but overly smooth, lacking fine structures. DUSt3R produces a very coarse and incomplete reconstruction. NeuS and BakedSDF generate overly smoothed meshes missing most details. While Bakedangelo produces high-quality meshes, our method consistently recovers more geometric intricacies. As visible in the detailed insets, our reconstructions better capture elements such as:
- Thin architectural features (decorative roof grids, railings).
- Urban furniture (lampposts, street signs).
- Complex roof structures with weak texture.
These improvements are crucial for applications requiring high-fidelity models from UAV drone surveys.
Quantitative Comparison
The quantitative evaluation based on the C2M metric is summarized in the table below. Our method achieves the lowest C2M error on all three UAV drone datasets.
| Method | CSNL Dataset (m) | Cuhk1 Dataset (m) | Cuhk4 Dataset (m) |
|---|---|---|---|
| Metashape MVS | 0.911 | 0.609 | 0.562 |
| DUSt3R | 2.050 | 2.626 | 2.802 |
| NeuS | 0.945 | 1.932 | 0.943 |
| BakedSDF | 0.480 | 0.557 | 0.349 |
| Bakedangelo | 0.487 | 0.337 | 0.333 |
| Our Method | 0.465 | 0.335 | 0.289 |
This demonstrates that our enhancements not only add visual detail but also improve the overall geometric accuracy of the reconstructed surface relative to the LiDAR ground truth.
Ablation Study
We conduct an ablation study on the CSNL UAV drone dataset to validate the contribution of each proposed component. We train three variants: 1) Bakedangelo with only the Softplus modification (B-Softplus), 2) Bakedangelo with only the multi-view loss (B+Multi), and 3) Bakedangelo with only the S3IM loss (B+S3IM). Results are compared against the full model and the original Bakedangelo.
| Ablation Variant | C2M (m) | Training Time | Key Observation |
|---|---|---|---|
| Bakedangelo (Original) | 0.487 | ~28h 52m | Baseline performance. |
| B-Softplus | 0.491 | ~28h 52m | Minor quantitative change, improves continuity. |
| B+Multi (ReLU) | 0.472 | ~28h 11m | Improves detail but can increase holes in low-overlap areas. |
| B+S3IM (ReLU) | 0.483 | ~28h 58m | Moderate quantitative improvement, some detail lacking. |
| Our Full Method | 0.465 | ~28h 53m | Best overall accuracy and detail preservation. |
The study shows that each component contributes to the final result. The multi-view consistency loss is particularly effective at recovering geometric details, while S3IM provides robust local supervision. The Softplus activation aids in maintaining surface continuity. Their combination in the full model synergistically achieves the best balance, enhancing detail recovery while preserving the completeness of the reconstruction, all without incurring additional training time.
Discussion and Future Work
Our enhanced Bakedangelo method effectively addresses specific limitations in state-of-the-art neural implicit reconstruction when applied to UAV drone imagery. The integration of $L_{multi}$ directly tackles the geometric ambiguity that can arise from varying drone viewpoints, enforcing a stronger 3D consensus. The $L_{S3IM}$ loss successfully capitalizes on the intrinsic locality of the rendering process, providing a more appropriate supervisory signal for recovering high-frequency details common in urban and built environments captured by drones. The switch to Softplus, while simple, contributes to a smoother optimization landscape for the signed distance field.
A limitation of our current work is its focus on scene-scale reconstruction. While effective for building complexes and campus areas, scaling to city-scale or extremely large UAV drone surveys remains a challenge due to memory and computational constraints of dense hash grids. Furthermore, the quality of reconstruction is still dependent on accurate initial camera poses from SfM.
Future work will explore several directions. First, we aim to investigate adaptive or sparse hash encodings to manage memory consumption for large-scale UAV drone projects. Second, integrating learnable exposure modeling could improve handling of varying lighting conditions across a drone flight mission. Finally, exploring the synergy between our SDF-based method and explicit representations like 3D Gaussian Splatting (3DGS) is a promising avenue. Hybrid approaches, where a coarse SDF guides detailed Gaussian reconstruction or vice-versa, could leverage the strengths of both paradigms for even faster and more detailed UAV drone-based modeling.
Conclusion
In this paper, we presented an enhanced neural implicit surface reconstruction method tailored for generating high-precision 3D models from UAV drone imagery. Building upon the robust Bakedangelo framework, we integrated a multi-view geometric consistency loss and a stochastic structural similarity loss to better exploit the information available in multi-perspective drone captures and the ray-sampling pipeline. Additionally, we refined the network’s activation function to improve scene continuity. Extensive experiments on multiple UAV drone datasets demonstrate that our method consistently outperforms strong baselines, including the original Bakedangelo, in recovering fine geometric details such as architectural elements and urban furniture, while maintaining high overall geometric accuracy as measured by Cloud-to-Mesh distance. This work provides a significant step towards fully automated, detail-preserving 3D reconstruction from widely available UAV drone imagery, with broad applications in surveying, mapping, and digital twin creation.
