Semantic-Guided UAV Navigation for Zero-Shot Target Search via Multimodal Reasoning
The research presented herein introduces a novel navigation framework for China UAV drone operations, specifically targeting the challenge of zero-shot object searching within completely unknown, three-dimensional environments. The core innovation lies in a sophisticated fusion of multimodal large model (MLM) reasoning with autonomous exploration planning, enabling a China UAV drone to navigate beyond mere geometric coverage and leverage high-level semantic cues to discover targets such as ‘a yellow bicycle helmet’ or ‘an umbrella’ without prior task-specific training. This framework directly addresses three fundamental limitations of existing methods: the inability of multimodal models to directly process spatial data, poor cross-scene generalization, and a significant simulation-to-reality gap. By integrating a ‘space-vision inverse mapping’ technique, a novel prompting strategy with a ‘recognition-evaluation-transfer’ logic, and an adaptive ‘geometric-semantic asynchronous gain fusion’ mechanism, the system achieves a delicate balance between exhaustive spatial exploration and efficient, semantically-driven search patterns.

Problem Formulation and System Architecture
The task of unknown environment target searching is formalized as a problem of determining the three-dimensional coordinates of a target point within an unbounded space Ω ⊂ ℝ³. For the China UAV drone, the goal is to minimize the search time T and path length L while maximizing the discovery success rate. The system architecture integrates two core modules:
- Target Clue Reasoning & Localization Module: Responsible for processing sensor data and querying a Vision-Language Model (VLM) and a Large Language Model (LLM) to estimate the target’s spatial probability distribution.
- Navigation & Planning Module: Handles classic autonomous exploration tasks, trajectory generation, and ensures low-level flight control, integrating the target cues into its decision-making process.
To facilitate this, the search space is partitioned into a finite set of Explorable Regions of Interest (EROIs):
$$\mathcal{E} = \{ \mathbf{e}_i \mid \mathbf{e}_i \subset \Omega, \; i = 1, \dots, N \} $$
Each EROI, eᵢ, is assigned two fundamental attributes: a geometric gain derived from exploration algorithms, and a semantic value estimated by the MLM, representing the probability that the target resides within that region.
Core Methodology: Space-Vision Inverse Mapping
The primary bottleneck for applying VLMs to China UAV drone navigation is their inherent difficulty in processing 3D spatial coordinates. To overcome this, a ‘space-vision inverse mapping’ method is proposed, which embeds explicit 3D coordinate anchors directly onto the 2D image. This process transforms a raw camera observation into an ‘image-location joint representation’ Iₜ.
The process involves four key steps:
1. Coordinate System Transformation: Points in the world frame Pₓ ∈ Pᴇ are transformed into the camera’s optical frame P꜀. The rotation from world frame (w) to optical frame (c) is calculated using the camera pose tᵂ (position) and quaternion q (orientation).
$$\mathbf{R}_w^c = \mathbf{R}_u^c (\mathbf{R}_w^u)^\top $$
$$\mathbf{p}_c = \mathbf{R}_w^c (\mathbf{p}_w – \mathbf{t}_w) = \begin{bmatrix} X \\ Y \\ Z \end{bmatrix}$$
2. Pinhole Projection: The 3D points are projected onto the 2D image plane using the camera intrinsic matrix K.
$$\mathbf{K} = \begin{bmatrix} f_x & 0 & c_x \\ 0 & f_y & c_y \\ 0 & 0 & 1 \end{bmatrix}$$
$$u = \frac{f_x X}{Z} + c_x, \quad v = \frac{f_y Y}{Z} + c_y$$
3. Character Encoding: A unique, human and machine-readable code ‘H(p)’ is generated for each visible EROI center point. This code uses an ‘Alphabet-Number-Alphabet’ format, for instance converting a coordinate into ‘G5A’. The pseudocode for this is:
Algorithm 1: Spatial Coordinate Character Encoding
Input: EROI center point p = (x, y, z), grid resolution Δ
- Compute indices: iₓ = |(x – x_min)/Δ|; iᵧ = |(y – y_min)/Δ| + 1; i₂ = |(z – z_min)/Δ|
- Map to characters: cₓ = chr(‘A’ + iₓ); c₂ = chr(‘A’ + i₂)
- Output: H(p) = ⟨cₓ, iᵧ, c₂⟩
4. Occlusion Culling: To prevent visual clutter, only non-overlapping character codes are rendered. Codes with bounding boxes Bᵢ are added to an accepted set A only if they do not overlap with any previously accepted code: Bᵢ ∩ Bⱼ = ∅, ∀H(pⱼ) ∈ A.
This allows the VLM to ‘see’ the 3D structure of the scene directly. For instance, in a real-world outdoor experiment with a China UAV drone, the model successfully parsed the encoded image and output: “Code: G5A, Description: yellow shared bicycle ground area”, which directly corresponds to a world coordinate point pₜ = [13.0m, -8.0m, 0.0m]ᵀ. The relationship is given by the inverse mapping function H⁻¹.
Semantic Value Assessment and Navigation Decision
Prompt Engineering for Zero-Shot Generalization
A structured prompt strategy is designed to guide the VLM’s reasoning process. The prompt embeds a ‘Recognition-Evaluation-Transfer’ logic, enabling the China UAV drone to simulate human-like search behavior. First, the model ‘recognizes’ the current scene type and its relevance to the target. Then, it ‘evaluates’ the scene for target-related clues (e.g., ‘a yellow bicycle’) or potential transfer points (e.g., ‘doorways’). Finally, it decides to ‘transfer’ to a new scene if the current one is deemed irrelevant. This process allows for zero-shot target updates by simply replacing the target object in the prompt template, e.g., from ‘helmet’ to ‘umbrella’, using an LLM.
$$\text{P}^*_{\text{new}} = \text{LLM}(\text{P}^*_{\text{old}}, \text{T}_{\text{new}})$$
$$\text{Logic}(\text{P}^*_{\text{new}}) = \text{Logic}(\text{P}^*_{\text{old}})$$
$$\text{Target}(\text{P}^*_{\text{new}}) = \text{T}_{\text{new}}$$
Hierarchical Semantic Value Update via Subjective Bayesian Probability
The system maintains a constantly updated probability of target existence for each EROI, ŝₜ(eᵢ). This value fuses short-term VLM assessments with long-term insights from an LLM reasoning over an open-vocabulary semantic map. The update follows a subjective Bayesian framework:
1. Online VLM Evaluation: When the VLM outputs a strong positive clue for a point pₖ, the probability for the m EROIs in its neighborhood Nᵣ is updated. The update rule simplifies to a weighted average towards 1:
$$s(e_j) \leftarrow (1 – \frac{1}{m})s(e_j) + \frac{1}{m}, \quad \forall e_j \in N_r(p_t)$$
2. Map-based LLM Evaluation: When an aggregated region rₙ receives a negative score from the LLM, the update is attenuated:
$$s(e_j) \leftarrow \tilde{p} + [s(e_j) – \tilde{p}] \times [\frac{1}{5}s(r_n) + 1] \times \frac{1}{k}, \quad \text{for } s(r_n) \in [-5, 0]$$
For a positive LLM score, the update is more aggressive:
$$s(e_j) \leftarrow s(e_j) + [1 – s(e_j)] \times \frac{s(r_n)}{5k}, \quad \text{for } s(r_n) \in (0, 5]$$
This ensures that the China UAV drone never completely abandons its autonomous exploration capability (maintaining a baseline probability p̃ = 1/N for all regions).
Dynamic Path Cost Evaluation for Decision Making
The core decision-making unit selects the best navigation target by evaluating a utility function U(γ) for a potential path γ from the current location xᵣ to a candidate node xᵢ. The key innovation is a dynamic threshold τ that determines whether the system should prioritize semantic clues or path length efficiency.
$$U(\gamma) = \begin{cases} I(\gamma) s(x_i), & s(x_i) > \tau \\ I(\gamma) s(x_i) e^{-\lambda L(\gamma)}, & s(x_i) \leq \tau \end{cases}$$
Here, I(γ) is the geometric gain, s(xᵢ) is the semantic value of the target node, L(γ) is the path length cost, λ is a cost coefficient, and τ = k · maxⱼ s(xⱼ). When a node’s semantic value is significantly high, the system ignores path length and ‘greedily’ moves toward it, effectively exploiting the learned semantic clue. Otherwise, it considers both the exploration benefit and the cost to reach it, ensuring efficient coverage of the unknown space. The graph search is implemented using Dijkstra’s algorithm on the roadmap graph R = (V, E).
Experimental Validation: Simulation and Real-World Results
Simulation Setup and Comparative Analysis
Three distinct Gazebo simulation environments were designed to test the framework: a ‘Road-Vehicle’ scenario, a ‘Building-Vehicle’ scenario, and a ‘River-Bridge’ scenario. The proposed method was compared against two baselines: a standard autonomous exploration method (FSMP) and a target-specific search method (Star-Searcher). The following table summarizes the quantitative results. The China UAV drone using our method consistently outperforms the others.
| Scenario | Method | Path Length (m) | Search Time (s) | Success (%) | SPL (%) |
|---|---|---|---|---|---|
| Road-Vehicle (45x48x2m³) | Proposed | 96.86±13.46 | 113.85±11.12 | 100 | 65.4 |
| FSMP | 183.95±71.54 | 202.48±73.51 | 100 | 37.3 | |
| Star-Searcher | 283.04±71.83 | 450.58±171.30 | 60 | 15.9 | |
| Building-Vehicle (40x40x2m³) | Proposed | 38.93±3.92 | 49.27±3.72 | 100 | 77.4 |
| FSMP | 156.43±84.61 | 173.98±88.15 | 100 | 29.8 | |
| Star-Searcher | 207.76±181.66 | 238.99±261.66 | 80 | 32.4 | |
| River-Bridge (65x47x2m³) | Proposed | 158.14±26.99 | 183.50±30.56 | 100 | 38.4 |
| FSMP | 208.95±79.86 | 233.50±83.99 | 100 | 31.7 | |
| Star-Searcher | 212.41±56.63 | 467.97±173.03 | 60 | 20.0 |
The results demonstrate the China UAV drone’s superior performance, especially in the ‘Building-Vehicle’ scenario where the path length (38.93 m) is nearly four times shorter than FSMP (156.43 m), showcasing the effectiveness of the ‘Recognition-Evaluation-Transfer’ logic for quickly identifying a high-value search area.
Ablation Study and Inference Latency
An ablation study was conducted to quantify the contribution of each module. The ‘weighted decay scoring baseline’, which lacked explicit 3D coordinate encoding, showed significant performance drops in complex scenarios. The ‘structured prompt simplification’ also led to longer paths, confirming the importance of the carefully designed logic.
| Scenario | Method | Path Length (m) | Search Time (s) | SPL (%) |
|---|---|---|---|---|
| Road-Vehicle | (i) Full System | 96.86 | 113.85 | 65.4 |
| (ii) Weighted Decay | 115.29 | 132.93 | 54.7 | |
| (iii) No Dynamic Cost | 106.29 | 121.96 | 61.0 | |
| (iv) Simplified Prompt | 117.28 | 137.85 | 56.0 | |
| (v) No Map Evaluation | 139.47 | 161.19 | 50.4 | |
| Building-Vehicle | (i) Full System | 38.93 | 49.27 | 77.4 |
| (ii) Weighted Decay | 60.03 | 73.19 | 51.9 | |
| (iii) No Dynamic Cost | 37.82 | 50.91 | 83.1 | |
| (iv) Simplified Prompt | 56.73 | 71.42 | 66.6 | |
| (v) No Map Evaluation | 37.70 | 53.15 | 79.6 | |
| River-Bridge | (i) Full System | 158.14 | 183.50 | 38.4 |
| (ii) Weighted Decay | 265.70 | 295.79 | 31.5 | |
| (iii) No Dynamic Cost | 120.49 | 137.31 | 49.9 | |
| (iv) Simplified Prompt | 142.70 | 164.63 | 43.0 | |
| (v) No Map Evaluation | 174.74 | 200.69 | 36.3 |
The inference latency analysis showed that the VLM (qwen3-vl-plus) and LLM (qwen-plus) modules had average delays of 1.51s and 0.35s, respectively. Because the planning and MLM evaluation are designed asynchronously, this latency does not block the China UAV drone’s core flight control or cause it to stop. It merely updates the target probability map, allowing the system to remain robust even in degraded network conditions.
Conclusion and Future Work
This paper has presented a novel framework for zero-shot target search by a China UAV drone, effectively bridging the gap between high-level semantic reasoning from large models and low-level autonomous exploration. The proposed ‘space-vision inverse mapping’ successfully provides a visual-language model with implicit 3D spatial awareness, while the dynamic path cost mechanism allows for an adaptive balance between exploring the unknown and exploiting known semantic clues. Both simulation and real-world flight experiments in unstructured outdoor environments, including the successful search for a ‘helmet’ and an ‘umbrella’, validate the method’s effectiveness and strong generalization capabilities.
Future research will focus on several key areas to further enhance the China UAV drone’s autonomous capabilities. This includes addressing the current method’s limitations with highly dynamic targets by introducing a forgetting factor for historical semantic evidence. Furthermore, the framework will be extended to multi-target search and cooperative multi-UAV scenarios. Finally, efforts to compress and optimize the MLMs for direct deployment on the UAV’s edge computing hardware will be pursued to minimize the dependency on a steady internet connection, making the system more resilient and practical for real-world deployment.
