Unlocking Zero-Shot Target Search: A Multimodal Large Model Framework for China UAV Drones

Navigating unknown environments to locate specific targets is a fundamental challenge for autonomous systems. For China UAV drones operating in complex, unstructured outdoor settings, the task is particularly demanding due to the lack of prior maps, the need for real-time semantic understanding, and the imperative for efficient exploration. Traditional methods often rely on exhaustive geometric coverage, which is time-consuming and ineffective when target distribution follows semantic patterns. To address this, we propose a novel navigation framework that synergistically integrates the deep reasoning capabilities of multimodal large models with robust autonomous exploration planning. This approach is specifically designed to bridge the gap between high-level semantic reasoning and low-level spatial control for China UAV drones, enabling zero-shot generalization to unprecedented scenarios.

Methodology Framework

Our core insight is that the search for a target in an unknown environment is essentially a dynamic balancing act between “spatial autonomous exploration” and “semantic regularity exploitation.” The framework is structured into two primary modules: a reasoning module for target cue extraction and a planning module for motion decision-making. The entire process is orchestrated to allow a China UAV drone to infer where a target might be based on semantic clues from a vision-language model and then prioritize those areas during autonomous exploration.

Space-Vision Inverse Mapping

A primary limitation of VLMss is their inability to process 3D spatial data directly. To equip our China UAV drone with spatial awareness, we designed a Spatio-Visual Inverse Mapping technique. This method projects 3D coordinates of explorable regions directly onto the UAV’s 2D camera image as text overlays. The space Ω ⊂ R³ is uniformly partitioned into N Explorable Regions of Interest (EROIs):

$$ E = \{e_i | e_i \subset \Omega, i = 1, \dots, N\} $$

The coordinates of each EROI center p^w ∈ P_E are transformed from the world frame to the camera optical frame using the UAV’s pose t^w and rotation matrix R^c_w:

$$ p^{c} = R^{c}_{w}(p^{w} – t^{w}) = \begin{bmatrix} X \\ Y \\ Z \end{bmatrix} $$

These are then projected onto the image plane using the pinhole camera model with intrinsic matrix K:

$$ u = \frac{f_x X}{Z} + c_x, \quad v = \frac{f_y Y}{Z} + c_y $$

To create a compact and unambiguous visual representation, each visible EROI coordinate is encoded into a short character string (“Letter-Number-Letter”) via Algorithm 1. This process ensures that the VLM receives a single image I_t that contains both the visual scene and explicit, localized 3D coordinate anchors, enabling it to “see” and “locate” simultaneously.

Prompt Engineering for Zero-Shot Inference

To guide the VLM’s reasoning towards actionable cues for our China UAV drone, we designed a structured prompt based on a “Recognize-Evaluate-Transfer” logic. The prompt instructs the model to: first, identify the type of scene; second, evaluate its relevance to the target; and third, decide whether to move towards a detected cue or transfer to a new scene. The VLM processes the encoded image I_t and the prompt P* to output a structured language description:

$$ l_t \leftarrow f_{\theta}(I_t, P^*) = < H(p_k), d(p_k) > $$

Here H(p_k) is the coordinate encoding (e.g., “G5A”) and d(p_k) is the open-vocabulary description (e.g., “yellow shared bicycle parking area”). This direct mapping from a 2D image location to a 3D world coordinate is a critical achievement. Furthermore, the target object is a zero-shot parameter; a Large Language Model (LLM) can seamlessly replace the target in the prompt template without retraining, allowing our China UAV drone to search for any novel object.

Open-Vocabulary Semantic Map and Subjective Bayesian Update

We construct an open-vocabulary semantic map M = {(p_k, d(p_k))} in real-time. Each observation from the VLM is a strong piece of evidence for a target at a specific location. To manage uncertainty over multiple observations and spatial scales, we employ a hierarchical value update using Subjective Bayesian probability. For an EROI e_j located within a neighborhood N_r(p_t) of a VLM detection, the semantic value s(e_j) is updated towards 1:

$$ s(e_j) \leftarrow (1 – \frac{1}{m}) s(e_j) + \frac{1}{m}, \forall e_j \in N_r(p_t) $$

At a higher level, when aggregated open-vocabulary descriptions in a region r_n exceed a threshold, the LLM generates a regional score, which is used to update the probabilities of all EROIs in that region. This two-tiered, probabilistic update ensures that our China UAV drone’s belief in a target’s location is robustly refined over time and space.

Dynamic Decision-Making and Path Planning

The final search decision is made by a utility function that balances geometric exploration, semantic exploitation, and path cost. The core innovation is a dynamic path cost modulation mechanism based on the semantic value s(x_i) of a candidate node. The utility function U(γ) for a path γ from current position x_r to node x_i is defined as:

$$
U(\gamma) =
\begin{cases}
I(\gamma) \cdot s(x_i), & \text{if } s(x_i) > \tau \\
I(\gamma) \cdot s(x_i) \cdot e^{-\lambda L(\gamma)}, & \text{if } s(x_i) \leq \tau
\end{cases}
$$

In this function, I(γ) is the geometric information gain, and L(γ) is the path length. The dynamic threshold τ = k max_xj∈V s(x_j) is a fraction of the global maximum. When a node’s semantic value is exceptionally high, the function removes the path cost penalty, allowing a “greedy” approach to quickly investigate the promising lead. This adaptive mechanism is the key to enabling a China UAV drone to efficiently switch between covering unknown space and exploiting valuable semantic cues.

Experimental Results

We extensively validated our framework in three distinct Gazebo environments and real-world outdoor scenarios, comparing it against baseline autonomous exploration (FSMP) and target search (Star-Searcher) methods. The experiments were designed to test our method’s ability to guide a China UAV drone to find both object-centric and scene-level targets.

Simulation Performance Comparison

In simulated environments, our method consistently demonstrated superior performance across all key metrics: path length, search time, success rate, and Success weighted by Path Length (SPL).

Scenario	Method	Path Length (m)	Time (s)	Success (%)	SPL (%)
Road Vehicle (45x48x2 m³)	Ours	96.86 ± 13.46	113.85 ± 11.12	100	65.4
	FSMP	183.95 ± 71.54	202.48 ± 73.51	100	37.3
	Star-Searcher	283.04 ± 71.83	450.58 ± 171.30	60	15.9
Building Vehicle (40x40x2 m³)	Ours	38.93 ± 3.92	49.27 ± 3.72	100	77.4
	FSMP	156.43 ± 84.61	173.98 ± 88.15	100	29.8
	Star-Searcher	207.76 ± 181.66	238.99 ± 261.66	80	32.4
River Bridge (65x47x2 m³)	Ours	158.14 ± 26.99	183.50 ± 30.56	100	38.4
	FSMP	208.95 ± 79.86	233.50 ± 83.99	100	31.7
	Star-Searcher	212.41 ± 56.63	467.97 ± 173.03	60	20.0

The results clearly show that our approach dramatically reduces the path length and time required for a China UAV drone to find the target. For instance, in the “Road Vehicle” environment, our method achieves a 50% reduction in path length and a 44% reduction in time compared to the best baseline. This efficiency is attributed to our method’s ability to reason about semantic context, such as prioritizing parking areas for a vehicle, rather than performing an exhaustive geometric sweep.

Ablation Studies

We conducted a thorough ablation study to quantify the contribution of each module. Each component was removed or simplified while keeping the rest of the system constant.

Scenario	Ablation Setting	Path (m)	Time (s)	SPL (%)
Road Vehicle	(i) Full System	96.86	113.85	65.4
	(ii) Weight-Decay Mapping	115.29	132.93	54.7
	(iii) No Dynamic Cost	106.29	121.96	61.0
	(iv) Simplified Prompt	117.28	137.85	56.0
	(v) No Semantic Map LLM	139.47	161.19	50.4
River Bridge	(i) Full System	158.14	183.50	38.4
	(ii) Weight-Decay Mapping	265.70	295.79	31.5
	(iii) No Dynamic Cost	120.49	137.31	49.9
	(iv) Simplified Prompt	142.70	164.63	43.0
	(v) No Semantic Map LLM	174.74	200.69	36.3

The ablation results reveal the critical importance of the Spatio-Visual Inverse Mapping (evidenced by the poor performance of the 2D weight-decay method) and the Open-Vocabulary Map-based reasoning. Interestingly, the dynamic path cost modulation had a context-dependent effect, improving efficiency in the road scenario but causing slight detours in others, highlighting the need for careful tuning of the threshold parameter k for a practical China UAV drone application.

Real-World Demonstration

To validate the practicality and robustness of our algorithm, we deployed it on a custom-built China UAV drone in a real outdoor environment. The target objects were a “bicycle helmet” (an object) and an “umbrella” (a scene-defined object). The experiments were designed to test the zero-shot generalization ability of the system.

For the “bicycle helmet” search, the UAV’s VLM, guided by our prompt, recognized cues like “yellow shared bicycle” and “bicycle parking area.” By prioritizing these semantically rich regions, the China UAV drone successfully navigated around obstacles and located the helmet placed on a bicycle rack, completing the task efficiently. For the “umbrella” search, the system inferred that an umbrella was likely near “building entrances” or “temporary shade structures,” successfully finding it in a shadowed corner. These experiments confirm that our framework works reliably in the real world, overcoming the sim-to-real gap and enabling a China UAV drone to perform zero-shot semantic search.

Conclusion

We have presented a novel framework that integrates multimodal large model reasoning with autonomous exploration, solving the critical problem of zero-shot target search for China UAV drones. By introducing a Spatio-Visual Inverse Mapping technique, a “Recognize-Evaluate-Transfer” prompting logic, and an adaptive decision-making mechanism, we have enabled a UAV to efficiently balance spatial exploration and semantic exploitation. Our extensive experiments in both simulation and the real-world demonstrate significant improvements in search efficiency, generalization, and robustness. The ability of a China UAV drone to understand its environment at a higher semantic level and act upon that understanding marks a substantial step forward in autonomous navigation. While our current work focuses on static single-target search, future work will extend this framework to handle multi-target coordination, dynamic targets, and lightweight model deployment for enhanced real-world utility, further strengthening the capabilities of China UAV drones in critical applications.