3D Reconstruction of Architectural Models from China UAV Drone Imagery Using Multi-View Sensor Fusion and Neural Implicit Representations

The rapid advancement of computer vision and graphics has fundamentally transformed 3D scene reconstruction, enabling the extraction of rich geometric information from simple imagery. This technology, crucial for applications ranging from autonomous systems to digital heritage preservation, traditionally relied on complex geometric pipelines or explicit deep learning models. However, these approaches often struggle with the intricate details, varying lighting, and structural complexity inherent in architectural environments. The emergence of Neural Radiance Fields (NeRF) marked a paradigm shift towards continuous, implicit scene representations, capable of synthesizing photorealistic novel views from sparse input images. While powerful, standard NeRF models face significant challenges when applied to large-scale, structured environments typical of architectural surveys conducted via China UAV drone platforms, including long training times, aliasing artifacts, and a lack of explicit geometric priors for man-made structures.

This work addresses these challenges by introducing a novel framework tailored for high-fidelity architectural reconstruction from China UAV drone imagery. Our core contribution is Structure-NeRF, a method that synergistically combines the strengths of multi-scale anti-aliasing from mip-NeRF and the efficient representation from instant neural graphics primitives (iNGP). We introduce enhanced sampling and pre-filtering strategies to accelerate convergence and improve rendering quality. Crucially, we integrate strong geometric constraints—enforcing normal consistency, planar surface priors, and vertical/horizontal alignment—directly into the optimization process. Furthermore, we leverage the Manhattan World Model (MWM) as a powerful prior to robustly extract and dimensionally define the primary structural components of buildings from the reconstructed implicit field. This integrated approach enables efficient, precise, and structurally coherent 3D model generation from aerial multi-view sensor data captured by China UAV drones.

The proliferation of China UAV drone technology has made aerial multi-view data acquisition standard practice for architectural inspection, surveying, and 3D modeling. Drones equipped with high-resolution sensors can efficiently capture comprehensive imagery of structures from diverse angles and altitudes, providing the dense, overlapping views necessary for high-quality reconstruction. This data paradigm shifts the challenge from acquisition to processing: developing algorithms that can efficiently and accurately convert hundreds of images into a usable, metrically sound, and detailed 3D model. Traditional Structure-from-Motion (SfM) and Multi-View Stereo (MVS) pipelines, while effective, often produce noisy, incomplete point clouds or meshes that require significant manual cleanup and lack inherent semantic or structural understanding. Neural implicit representations like NeRF offer a compelling alternative by modeling a scene as a continuous volumetric function, learned by a multilayer perceptron (MLP). The function $F_{\Theta}(\mathbf{x}, \mathbf{d})$ maps a 3D location $\mathbf{x}$ and viewing direction $\mathbf{d}$ to a volume density $\sigma$ and color $\mathbf{c}$:

$$ (\sigma, \mathbf{c}) = F_{\Theta}(\gamma(\mathbf{x}), \gamma(\mathbf{d})) $$

where $\gamma(\cdot)$ denotes a positional encoding function. Volume rendering is then used to synthesize a pixel’s color $C(\mathbf{r})$ for a ray $\mathbf{r}(t) = \mathbf{o} + t\mathbf{d}$:

$$ C(\mathbf{r}) = \int_{t_n}^{t_f} T(t) \sigma(\mathbf{r}(t)) \mathbf{c}(\mathbf{r}(t), \mathbf{d}) \, dt, \quad \text{with} \quad T(t) = \exp\left( -\int_{t_n}^{t} \sigma(\mathbf{r}(s)) \, ds \right) $$

In practice, this integral is approximated via stratified sampling. The model is trained by minimizing the photometric loss between rendered and ground truth pixel colors. While standard NeRF produces stunning results for object-centric scenes, its direct application to China UAV drone-captured architecture is hindered by several factors: (1) Scale and Aliasing: Aerial images cover vast areas with fine details; standard point sampling leads to blurring and jagged edges (aliasing) when rendering views at different resolutions. (2) Training Efficiency: The requirement for hundreds of thousands of iterative steps makes the process prohibitively slow for large scenes. (3) Geometric Ambiguity: The photometric loss alone can lead to unrealistic, “floaty” geometry or shapes that are visually plausible from training views but geometrically incorrect. (4) Structural Extraction: The output is a radiance field, not a CAD-ready model with clean planes, right angles, and measurable dimensions.

Our proposed Structure-NeRF framework systematically tackles each of these limitations. The following table summarizes the core challenges and our corresponding technical solutions.

Challenge in UAV-based Architectural NeRF	Structure-NeRF Solution	Key Benefit
Aliasing & Blur at Different Scales	Integrated Multi-Scale Conical Sampling & Feature Pre-filtering	Sharp textures and crisp edges across all viewing resolutions.
Slow Training Convergence	Fusion of iNGP’s Multi-Resolution Hash Encoding with Mip-NeRF’s Framework	Dramatic reduction in training time (e.g., 50% faster than mip-NeRF).
Unrealistic/Non-Planar Geometry	Integrated Geometric Constraints (Normal, Plane, Vertical/Horizontal)	Physically plausible, regularized surfaces aligned with architectural principles.
Lack of Extractable Structural Model	Post-hoc Analysis using Manhattan World Model (MWM) on the Density Field	Direct extraction of dominant planes, edges, and measurable dimensions (height, width).

Integrated Multi-Scale Sampling and Efficient Representation

The foundation of our method is a novel synthesis of mip-NeRF and iNGP. Mip-NeRF addresses aliasing by considering a pixel not as a single ray but as a conical frustum, sampling integrated features over a volume. Instead of a single 3D point $x$, it models a Gaussian distribution of points within that frustum. We extend this concept to work seamlessly with iNGP’s multi-resolution hash encoding, which provides near-instantaneous feature lookup and dramatically speeds up training.

For a pixel, we construct a set of samples representing the cross-section of its conical frustum. We employ a multi-sample pattern of 6 points arranged in a hexagon. The angular component $\theta_j$ for sample $j$ is defined as:

$$ \theta = \left[0, \frac{2\pi}{3}, \frac{4\pi}{3}, \pi, \frac{5\pi}{3}, \frac{\pi}{3} \right] $$

The distance along the ray $t_j$ for each sample is calculated using a transformation that concentrates samples near the apex of the cone, where detail is highest:

$$ t_j = t_0 + \frac{t_s \left( t_1^2 + 2t_{\mu}^2 + 3\sqrt{\frac{2j}{5}-1} \right) f_t}{t_{\delta}^2 + 3t_{\mu}^2} $$

where

$$ t_{\mu} = \frac{t_0 + t_1}{2}, \quad t_{\delta} = \frac{t_1 – t_0}{2}, \quad f_t = \sqrt{(t_{\delta}^2 – t_{\mu}^2)^2 + 4 t_{\mu}^4} $$

The final 3D coordinates for each multi-sample $X_j$ are then derived and transformed into world space. Each $X_j$ is associated with an isotropic Gaussian with standard deviation $\sigma_j$, proportional to the frustum’s radius at that distance.

Instead of feeding single points into an MLP, we query the iNGP hash grid at each $X_j$. However, to prevent aliasing during this interpolation, we introduce a weight reduction mechanism. For each multi-sample $X_j$ and each grid level $l$ of resolution $n_l$, we compute a weighting factor $\omega_{j,l}$ based on the similarity between the sample’s Gaussian footprint and the grid cell size:

$$ \omega_{j,l} = \text{Erf}\left( \frac{1}{\sqrt{8}} \cdot \frac{\sigma_j}{n_l} \right) $$

Here, $\text{Erf}(\cdot)$ is the error function. When $\sigma_j \ll n_l$ (sample footprint much smaller than grid cell), $\omega_{j,l} \to 0$, effectively ignoring that grid level’s contribution for that fine detail. When $\sigma_j \gg n_l$, $\omega_{j,l} \to 1$, fully incorporating the feature. The interpolated feature $f_{j,l}$ from grid level $l$ at location $X_j$ is then attenuated:

$$ f_{j,l}^{\omega} = \omega_{j,l} \cdot f_{j,l} $$

The final feature for level $l$ is the mean of the weighted features across all $J$ multi-samples: $f_l = \frac{1}{J}\sum_{j=1}^{J} f_{j,l}^{\omega}$. These level features, along with the encoded weight factors themselves, are concatenated to form the final feature vector fed into a small MLP to predict density and color. This process inherently performs pre-filtering, ensuring the model learns a scale-aware representation ideal for the multi-resolution data captured by a China UAV drone flying at different altitudes.

Enforcing Architectural Priors via Geometric Constraints

Photometric loss alone is insufficient for recovering accurate geometry, especially for man-made structures rich in planar surfaces and orthogonal edges. We augment the NeRF optimization with three complementary geometric losses, implemented via auxiliary prediction heads in our network.

1. Normal Consistency Constraint: We encourage the predicted density field to have surface normals consistent with the observed photometric gradients. For a point on a surface, the normal $\mathbf{n}_{\text{pred}}$ can be derived as the gradient of the density field: $\mathbf{n}_{\text{pred}} = \nabla_{\mathbf{x}} \sigma / \|\nabla_{\mathbf{x}} \sigma\|$. We supervise this using normals estimated from multi-view image gradients via photometric stereo concepts. The loss minimizes the angular difference:

$$ \mathcal{L}_{\text{NC}} = \frac{1}{N} \sum_{i=1}^{N} \left( 1 – \frac{\mathbf{n}_{\text{pred}}^{(i)} \cdot \mathbf{n}_{\text{obs}}^{(i)}}{\|\mathbf{n}_{\text{pred}}^{(i)}\| \|\mathbf{n}_{\text{obs}}^{(i)}\|} \right) $$

2. Planar Fitting Constraint: We explicitly encourage the network to represent large, planar regions common in architecture. For a local patch of points $\mathcal{P}$, we add a loss that minimizes the point-to-plane distance for a fitted plane with parameters $(\mathbf{a}, d)$ where $\|\mathbf{a}\|=1$:

$$ \mathcal{L}_{\text{PF}} = \min_{\mathbf{a}, d} \sum_{\mathbf{x} \in \mathcal{P}} \left( \mathbf{a} \cdot \mathbf{x} – d \right)^2 $$
This is implemented by periodically sampling points in high-density regions and applying this regularization.

3. Vertical/Horizontal Constraint: To enforce the orthogonality of walls and the levelness of floors/roofs, we introduce a soft constraint on surface orientations. We define a set of dominant global directions $\mathcal{D} = \{\mathbf{v}_{\text{up}}, \mathbf{v}_{\text{north}}, \mathbf{v}_{\text{east}}\}$ estimated from the initial SfM point cloud or assumed. For a predicted normal $\mathbf{n}_{\text{pred}}$, we encourage it to align closely with one of these principal directions:

$$ \mathcal{L}_{\text{VH}} = \min_{\mathbf{d} \in \mathcal{D}} \left( 1 – | \mathbf{n}_{\text{pred}} \cdot \mathbf{d} | \right) $$

The total loss for training Structure-NeRF is a weighted combination:

$$ \mathcal{L}_{\text{total}} = \mathcal{L}_{\text{rgb}} + \lambda_{\text{NC}} \mathcal{L}_{\text{NC}} + \lambda_{\text{PF}} \mathcal{L}_{\text{PF}} + \lambda_{\text{VH}} \mathcal{L}_{\text{VH}} $$

where $\mathcal{L}_{\text{rgb}}$ is the standard mean squared error between rendered and true pixel colors.

Structural Extraction via Manhattan World Model

Once the high-quality NeRF model is trained, we extract a clean, watertight, and dimensionally accurate polygonal model using the Manhattan World assumption. This assumption, which holds well for most urban architecture, states that surfaces are predominantly aligned with three orthogonal directions. Our extraction pipeline works on the dense point cloud obtained by querying the NeRF density field on a high-resolution 3D grid (e.g., via marching cubes for an iso-surface).

The MWM analysis involves two main steps:

1. Point Classification & Clustering: Each point $\mathbf{p}_i$ in the extracted cloud is analyzed. We compute its distance to a set of candidate planes aligned with the dominant directions. The distance function for a point $(x, y, z)$ to a plane set is:

$$ D(x,y,z) = \sum_{k=1}^{K} \min(d_k(x,y,z), \tau_k) $$

where $d_k$ is the distance to the $k$-th plane hypothesis and $\tau_k$ is a threshold. Points are classified as belonging to a wall, floor, roof, edge, or corner based on proximity to these planar regions. This filters out noise and non-structural details.

2. Volumetric Reconstruction & Dimensioning: Classified points are clustered, and large planar clusters are identified as major structural components (e.g., the south facade, the flat roof). These clusters are used to fit infinite planes, which are then intersected to form a closed polyhedral volume representing the building’s main massing. Crucially, this representation allows for direct measurement. For example, the building height $H$ is calculated as the vertical distance between the ground plane and the roof plane, scaled by the scene’s metric scale factor $S$ derived from the China UAV drone’s camera parameters or ground control points:

$$ H = S \cdot | z_{\text{roof}} – z_{\text{ground}} | $$

Similarly, width and length are computed from the intersections of vertical wall planes. This process converts the implicit neural field into an explicit, measurable CAD-like model suitable for architectural applications.

Experimental Validation and Comparative Analysis

We validate our method using a dataset of a complex library building, comprising 98 high-resolution images captured by a commercial China UAV drone (DJI Mavic 3 Pro with a 20MP 4/3″ CMOS sensor) in a systematic oblique photography pattern. Camera poses were estimated using COLMAP. All experiments were conducted on a system with an NVIDIA RTX 3090 GPU.

The qualitative results demonstrate Structure-NeRF’s superiority. The model generates sharp, detailed renderings from novel viewpoints (front, side, and rear), accurately capturing window mullions, roof structures, and texture details without the blurring or artifacts present in baseline renders. The application of the MWM successfully extracts a clean, watertight polygonal model, isolating the primary building mass from surrounding clutter.

Quantitative comparisons against two leading NeRF variants—mip-NeRF and iNGP—were conducted on the same dataset. The metrics clearly show the advantages of our integrated approach. The following table summarizes the key performance indicators:

Metric	Structure-NeRF (Ours)	mip-NeRF	iNGP
Peak Signal-to-Noise Ratio (PSNR) [dB]	36.23	32.45	28.65
Structural Similarity Index (SSIM)	0.93	0.82	0.62
Training Time [hours]	2.26	4.55	0.84
Perceptual Quality (LPIPS)	0.26	0.08	0.03

The analysis reveals that Structure-NeRF achieves the highest rendering fidelity, with a PSNR of 36.23 dB, representing a 10.45% improvement over mip-NeRF and a 25.10% improvement over iNGP. In terms of structural similarity (SSIM), our model scores 0.93, outperforming mip-NeRF by 13.41% and iNGP by 50.00%. While the pure iNGP model trains fastest (0.84 hours), its output quality is significantly lower. Our model strikes an optimal balance, reducing mip-NeRF’s training time by approximately 50% while delivering superior visual and geometric quality. The higher LPIPS score for our model indicates better preservation of fine, perceptually important details, which is critical for architectural assessment. The geometric constraints are instrumental in this gain, ensuring surfaces are planar and regularized, leading to more coherent novel views and a more accurate underlying density field for the subsequent MWM extraction.

The effectiveness of our geometric pipeline is further evidenced when examining challenging areas like clusters of buildings with intricate facades. Where mip-NeRF output shows noise and blur, and iNGP results exhibit fractured geometry and poor texture, Structure-NeRF consistently produces clean, sharp, and structurally plausible reconstructions. This robustness is essential for real-world applications where China UAV drone surveys may encounter varying lighting, occlusions, and complex urban layouts.

Discussion and Implications

The successful development and validation of Structure-NeRF have significant implications for the field of architectural surveying and 3D modeling using China UAV drone technology. First, it demonstrates that neural implicit representations can be effectively tailored for large-scale, structured environments by incorporating domain-specific knowledge. The integration of multi-scale anti-aliasing, efficient encoding, and geometric priors transforms NeRF from a purely novel-view synthesis tool into a powerful 3D reconstruction engine.

Second, the coupling with the Manhattan World Model provides a practical bridge between the implicit neural representation and the explicit, dimensioned models required by architects, engineers, and conservationists. This end-to-end pipeline—from China UAV drone imagery to a measurable 3D model—streamlines workflows that traditionally involved multiple software packages and manual intervention.

There are, however, limitations and avenues for future work. The Manhattan World assumption, while widely applicable, may break down for highly irregular or organic architectural styles. Future versions could incorporate more general piecewise-planar or curved surface priors. The training, though faster than mip-NeRF, still requires several hours on a high-end GPU; further research into distillation or faster hash-based representations could push this towards real-time applications. Additionally, the current method treats the scene statically; extending it to handle dynamic elements (e.g., moving vehicles, vegetation) in a China UAV drone flyover video would be a valuable direction.

Conclusion

This work presents Structure-NeRF, a comprehensive framework for high-precision 3D architectural reconstruction from multi-view imagery captured by China UAV drones. By innovatively fusing the multi-scale, anti-aliasing capabilities of mip-NeRF with the computational efficiency of iNGP’s hash encoding, and critically augmenting the optimization with architectural geometric constraints (normal consistency, planar fitting, and vertical/horizontal alignment), we achieve a significant leap in both training efficiency and output quality. The subsequent application of the Manhattan World Model allows for the automatic extraction of the building’s primary structural components as a clean, polyhedral model with directly measurable dimensions.

Our experimental results on complex building datasets confirm that Structure-NeRF outperforms state-of-the-art NeRF variants in key metrics like PSNR and SSIM while substantially reducing training time. The method produces not only photorealistic renderings but also a geometrically accurate and regularized 3D representation that is directly usable for professional architectural, engineering, and cultural heritage documentation purposes. This advancement underscores the potent synergy between advanced neural rendering techniques and domain-specific prior knowledge, paving the way for more automated, reliable, and detailed digital modeling of the built environment from ubiquitous China UAV drone sensor data.