Unmanned Aerial Vehicle Cruise Risk Identification Technology Based on Multi-Source Data and Large Models

In recent years, the concept of urban air mobility centered on low-altitude development has gained traction, with the low-altitude economy poised to become a new engine for economic growth. However, ensuring the safe operation of aerial vehicles remains a critical challenge. Incidents such as the collision of a fixed-wing test Unmanned Aerial Vehicle with a sports center in Jingzhou, China, in December 2024, resulting in severe injuries, highlight the urgency of addressing risks during cruise phases. These risks are characterized by their unpredictability, complexity, and potential for severe consequences, especially as operations expand from open fields to urban, mountainous, and aquatic environments. This paper explores a novel approach to risk identification for Unmanned Aerial Vehicles, particularly micro, light, and small drones operating in Class W airspace, by leveraging multi-source data and large models to enhance safety and efficiency.

The cruise phase of a Unmanned Aerial Vehicle, defined from ascent completion to descent initiation, involves navigating dynamic and often cluttered airspaces. Risks during this phase include collisions with static and dynamic obstacles, environmental hazards like extreme weather, technical failures, human errors, communication disruptions, regulatory non-compliance, and malicious attacks. For instance, analysis of 190 drone safety incidents from the Aviation Safety Network reveals that external factors in Class W airspace predominantly lead to collisions with objects such as buildings, trees, poles, wires, birds, and other aircraft. Thus, this study focuses on collision risks, decomposing them into key elements: the relative spatial information between the Unmanned Aerial Vehicle and obstacles, including position, distance, size, and velocity. A prompt word template is established to guide large models in risk assessment, incorporating features like obstacle category, location, distance, dimensions, and motion data for static, dynamic, and sudden-risk obstacles. This structured approach enables comprehensive risk identification by highlighting critical parameters for analysis.

To effectively integrate multi-source data, we analyze multimodal large models, which typically consist of modality encoders, input mappers, language model backbones, output mappers, and modality generators. Training involves pre-training with cross-modal datasets (e.g., image-text pairs) and fine-tuning with instruction-based data to enhance generalization. Three fusion schemes are considered for the prompt word generation model: Scheme 1 processes modalities independently and fuses features via voting or concatenation; Scheme 2 maps encoded features to a unified language space for cross-attention learning; and Scheme 3 employs a multimodal Transformer for end-to-end cross-modal interaction. Based on computational efficiency and real-time requirements for Unmanned Aerial Vehicle applications, Scheme 1 is adopted, as it minimizes training data and computational demands while supporting incremental updates. This scheme aligns with the need for rapid deployment in dynamic environments, where resources are constrained. The comparison of these schemes is summarized in Table 1, highlighting their training data requirements, computational load, and generalization capabilities.

Table 1: Comparison of Data Fusion Schemes for Prompt Word Generation
Scheme	Training Data Requirement	Computational Load (TFLOPS)	Generalization Ability	Key Advantages
Scheme 1	Low (uses pre-trained single-modal models)	0.8	Moderate (depends on module design)	Computationally efficient, easy deployment, supports incremental updates
Scheme 2	High (requires cross-modal alignment data)	2.5	Strong (enables inter-modal learning)	Deep cross-modal understanding, supports complex semantic reasoning
Scheme 3	Very high (massive multimodal data)	5.0+	Very strong (end-to-end adaptation)	Full-modal joint optimization, excellent few-shot generalization

The prompt word generation model is constructed with three integrated modules: macroscopic scene description, dynamic scene supplementation, and sudden risk detection. This modular design allows for efficient processing of data from monocular cameras, depth cameras, and LiDAR sensors, addressing limitations of single sensors such as occlusion, lighting dependence, and sparse data. For macroscopic scene description, the Owl-ViT model is employed for open-vocabulary object detection, leveraging its Vision Transformer architecture and contrastive learning to identify static obstacles like buildings and trees without extensive retraining. The model processes RGB images, outputting bounding boxes with categories, positions, distances, and confidence scores. The distance $d$ to an obstacle is calculated using depth maps, with the error margin derived from sensor accuracy. For a detected object, the confidence score $c$ and spatial coordinates $(x, y, z)$ are integrated into the prompt. The scene context can be represented as:

$$ \text{scene\_context} = \sum_{i=1}^{N} \left[ \text{category}_i, d_i, c_i, (x_i, y_i, z_i) \right] $$

where $N$ is the number of top-detected obstacles (e.g., up to 50). This module achieves an accuracy of approximately 75% with an average latency of 60-80 ms on an RTX 4090 GPU, making it suitable for real-time applications in diverse environments such as suburban, urban, and night scenes.

Dynamic scene supplementation utilizes the ByteTrack algorithm for tracking moving objects like birds or other drones. ByteTrack associates detection boxes across frames using IoU matching and Kalman filtering, incorporating both high and low-score detections to reduce ID switches and missed tracks. The relative velocity $v$ of a dynamic obstacle is computed from position changes over consecutive frames, with the direction vector $\vec{v}$ determined from coordinate transformations. The dynamic context is formulated as:

$$ \text{dynamic\_context} = \sum_{j=1}^{M} \left[ \text{category}_j, d_j(t), v_j(t), \vec{v}_j(t) \right] $$

where $M$ is the number of tracked objects, and $t$ denotes the timestamp. This module adds only 5-10 ms of latency to the detection process, ensuring stable tracking with an accuracy reliant on the underlying Owl-ViT detector. The combination of semantic information from Owl-ViT and motion data from ByteTrack provides a comprehensive view of dynamic risks without the need for additional ReID models.

Sudden risk detection addresses obstacles that enter the safety zone unexpectedly, using LiDAR point cloud data. The processing pipeline includes region constraint, noise removal via statistical filtering, and downsampling. A conical primary region of ±60° in the flight direction is defined, with voxel downsampling applied adaptively based on distance. DBSCAN clustering groups point clouds into obstacles, with parameters like neighborhood radius $r$ and minimum points $minPts$ adjusted dynamically based on altitude and speed. For example, $r$ may range from 0.1 m indoors to 1 m outdoors, and $minPts$ from 5 for sparse clouds to 50 for dense ones. Multi-frame association ensures stable IDs by matching clusters based on distance thresholds, motion consistency, and shape similarity. The deformation rate $\delta$ of an obstacle, indicating its rigidity, is computed over 5 frames as:

$$ \delta = 0.6 \cdot \delta_{\text{volume}} + 0.4 \cdot \delta_{\text{aspect ratio}} $$

where $\delta_{\text{volume}}$ and $\delta_{\text{aspect ratio}}$ are changes in volume and aspect ratio, respectively. If $\delta > 0.4$, the object is classified as soft, posing lower threat. The point cloud context outputs attributes like distance, size, volume, and aspect ratio for each cluster, with an accuracy of 85% and latency of 25-35 ms. This module effectively captures unforeseen obstacles, complementing the visual-based modules.

Table 2: Performance Metrics of Detection and Tracking Modules
Module	Implementation Model	Average Latency (ms)	Recognition Accuracy (%)	Key Outputs
Macroscopic Scene Description	Owl-ViT-Large	60-80	75	Category, distance, position, confidence
Dynamic Scene Supplementation	Owl-ViT + ByteTrack	+5-10 (on detection)	75 (depends on detector)	Velocity, direction, tracked ID
Sudden Risk Detection	Point cloud processing	25-35	85	Size, volume, deformation rate
Risk Judgment	Deepseek	40-50	80	Risk level, content, safety advice

The outputs from these modules are concatenated into a unified prompt word, which is fed to the Deepseek large language model for comprehensive risk analysis. Deepseek processes the prompt to generate detailed risk assessments, including risk content, severity levels, and safety recommendations. For example, it might output: “Risk identified: Dynamic obstacle (bird) approaching at high speed. Risk level: High. Recommendation: Evade horizontally or ascend.” This integration leverages Deepseek’s reasoning capabilities and knowledge base, achieving 80% accuracy in risk judgment with a latency of 40-50 ms. The total latency for risk identification, combining parallel execution of dynamic and sudden risk modules (70-85 ms) with Deepseek inference (40-50 ms), ranges from 110 to 135 ms. Through model quantization and edge deployment, this can be reduced to 50-80 ms, meeting the tolerance for low-speed Unmanned Aerial Vehicle cruise. The end-to-end delay breakdown is provided in Table 3, illustrating how sensor data acquisition overlaps with risk identification to minimize total latency.

Table 3: End-to-End Latency Breakdown for Obstacle Avoidance
Stage	Subtask	Latency Range (ms)	Influencing Factors and Optimization
Sensor Data Acquisition	Camera image transmission, LiDAR point cloud generation	5-20	Sensor hardware performance
Risk Identification	Macroscopic scene detection, sudden obstacle detection	50-80	Model quantization, hardware acceleration (e.g., TensorRT)
Decision Planning	Path replanning, dynamic obstacle prediction	10-30	Algorithm complexity
Control Execution	Motor response, attitude adjustment	5-15	Flight controller performance
Total Latency	–	70-145	–

To operationalize this technology, a Unmanned Aerial Vehicle cruise risk identification system was developed, featuring four interfaces: real-time monitoring, device management, task execution, and user configuration. The real-time monitoring interface visualizes detection and tracking results, displaying scene_context, dynamic_context, and pointcloud_context, and integrates device information, task details, weather data, and system status to enrich the prompt for Deepseek. The device management interface handles parameters such as Unmanned Aerial Vehicle type, dimensions, weight, endurance, minimum turn radius, maximum altitude, and speed. The task execution interface manages active missions, including task type, alert distance, brake distance, and vertical/horizontal obstacle avoidance thresholds. These interfaces ensure that Deepseek has contextual knowledge for accurate risk assessment. The user configuration module allows mode selection, API setup for Deepseek, and parameter tuning based on experimental or scenario-specific needs. This system not only facilitates real-time risk analysis but also supports device and task management, enhancing the practicality of the JUYE UAV platform in various operational contexts.

In conclusion, this study deconstructs Unmanned Aerial Vehicle cruise risks in Class W airspace, emphasizing collision-related elements from static, dynamic, and sudden obstacles. By adopting a multi-module approach for multi-source data fusion, we leverage Owl-ViT for static scene description, ByteTrack for dynamic object tracking, and point cloud processing for sudden risk detection, all guided by a prompt word template to enable Deepseek’s risk judgment. The results demonstrate effective risk identification across multiple scenarios, with latency optimized for real-time operation. The developed system provides a scalable platform for risk analysis and预警, supporting the safe integration of Unmanned Aerial Vehicles into complex environments. Future work will focus on model optimization to reduce latency further and enhance adaptability for high-speed operations, ensuring that JUYE UAV and similar platforms can thrive in the evolving low-altitude economy.