Real-time Anti-UAV Early Warning System Based on Intelligent Visual Detection and Semantic Analysis

The proliferation of Unmanned Aerial Vehicles (UAVs) has introduced significant security vulnerabilities in low-altitude airspace. Incidents involving unauthorized flights (“black flights”) and illegal reconnaissance near sensitive sites pose escalating threats to national security and public safety. Effective countermeasures require systems capable of real-time, accurate detection and robust warning communication. Traditional optical detection methods, while cost-effective, often struggle in communication-constrained environments (e.g., remote borders) due to the high bandwidth required for transmitting raw video or images for real-time warning. To address this critical gap in anti-UAV technology, this paper proposes an intelligent, multimodal early warning system. The system integrates the high-speed YOLOv10 object detection framework with the semantic comprehension capabilities of the GLM-4V multimodal large language model. This fusion creates a technological pipeline of “visual detection-semantic analysis-dynamic transmission,” enabling reliable real-time warning even under bandwidth-limited conditions. The core innovation lies in dynamically adapting the warning output: delivering structured graphical data under normal bandwidth and switching to lightweight, semantically rich text descriptions when network conditions are poor, thus ensuring continuous anti-UAV surveillance capability.

The overall architecture of the proposed real-time anti-UAV early warning system is designed for robustness and adaptability. It consists of a hardware suite for data acquisition and a software module for intelligent processing. The hardware layer employs a pan-tilt unit (PTU) equipped with dual-spectrum cameras—a visible-light camera for daytime high-resolution imaging and an infrared thermal camera for low-visibility or nighttime operations. This sensor data is streamed to a central computing unit which hosts the core algorithms and client software. The software layer is responsible for the complete processing pipeline: capturing video streams, executing real-time drone detection and tracking, generating multimodal semantic descriptions, and providing an intuitive user interface for control and display. The system’s operational workflow is cyclical: optical detection captures scene data; algorithms detect, locate, and track UAVs; semantic models generate contextual descriptions; and the PTU can be guided to follow the target, ensuring persistent monitoring. This closed-loop design is fundamental to creating an effective and autonomous anti-UAV sentinel.

Algorithmic Core: Detection, Tracking, and Semantic Transformation

At the heart of the anti-UAV warning system is a streamlined algorithmic stack designed for speed and accuracy. For the primary task of locating UAVs in video frames, we employ the YOLOv10 (You Only Look Once, version 10) model. Chosen for its optimal balance between precision and computational efficiency, YOLOv10 excels at single-shot detection, providing bounding box coordinates and confidence scores for UAVs in real-time. To maintain identity and track movement across frames—a critical requirement for assessing threat trajectory—the detections from YOLOv10 are fed into the SORT (Simple Online and Realtime Tracking) algorithm. SORT is a pragmatic, high-speed tracking framework that associates detections between frames using a combination of the Kalman Filter and the Hungarian algorithm.

The Kalman Filter predicts the future state of each tracked target based on its past motion model. Its recursive process involves two main steps: prediction and update. The prediction step estimates the target’s next state and the uncertainty of that estimate.

The state prediction is given by:
$$\hat{x}_t^{-} = F\hat{x}_{t-1} + Bu_{t-1}$$
where $\hat{x}_t^{-}$ is the prior state estimate (e.g., predicted position and velocity), $F$ is the state transition matrix, $\hat{x}_{t-1}$ is the previous optimal estimate, $B$ is the control input matrix, and $u_{t-1}$ is the control vector.

The prior estimate covariance is:
$$P_t^{-} = FP_{t-1}F^T + Q$$
where $P_t^{-}$ is the prior error covariance and $Q$ is the process noise covariance.

The update step corrects the prediction using the new measurement $Z_t$. First, the Kalman Gain $K_t$ is calculated, which determines the weight given to the measurement versus the prediction:
$$K_t = \frac{P_t^{-}H^T}{HP_t^{-}H^T + R}$$
Here, $H$ is the measurement matrix that maps the state to the measurement space, and $R$ is the measurement noise covariance.

This gain is then used to fuse the prediction with the measurement to produce the optimal posterior state estimate:
$$\hat{x}_t = \hat{x}_t^{-} + K_t(Z_t – H\hat{x}_t^{-})$$
Finally, the posterior error covariance is updated:
$$P_t = (I – K_tH)P_t^{-}$$
This efficient filtering allows SORT to maintain smooth and consistent tracks for fast-moving UAVs, forming a reliable foundation for the anti-UAV system’s situational awareness.

Multimodal Semantic Analysis for Adaptive Warning

The defining innovation of our anti-UAV system is its ability to transform visual data into actionable, bandwidth-efficient intelligence. Relying solely on transmitting bounding box coordinates offers limited context. Conversely, sending full images or video is often impractical. Our solution leverages a multimodal large language model (MLLM), specifically GLM-4V, to generate rich textual descriptions of the scene. This serves a dual purpose: it provides human-interpretable context (e.g., “a dark UAV with four rotors hovering above a wooded area near a communication tower”), and more importantly, it creates a highly compressible data stream for transmission in low-bandwidth scenarios, ensuring the anti-UAV warning reaches the command center.

To ensure the generated text is structured, relevant, and reliable, we employ a carefully engineered Prompt Template. Without guidance, MLLMs can produce verbose or irrelevant descriptions. Our template structures the input to the model, instructing it to describe specific aspects in a logical order. For scene context, a template might be: “This image shows a scene with {sky description}. In the foreground, there is {foreground description}. In the background, {background description} is visible.” The detected UAV’s cropped image is analyzed with another prompt: “Describe this object in detail, focusing on its color, shape, number of rotors, and any distinctive markings.” Finally, the UAV’s positional data from the tracker is formatted textually: “A UAV is detected in the upper-left quadrant of the field of view, with pixel coordinates (x, y) relative to the image center.” The synthesis of these three text blocks—scene context, UAV description, and UAV location—forms a comprehensive, lightweight semantic warning report. This adaptive output strategy is key to the system’s versatility, allowing it to function as a high-fidelity anti-UAV sensor or a low-bandwidth semantic sentinel as required.

Experimental Validation and System Performance

The proposed system was rigorously evaluated for both detection/tracking performance and the effectiveness of its semantic warning generation. Tests were conducted using a custom-built UAV dataset and real-world field deployments. To benchmark the core detector, YOLOv10 was compared against several predecessors, with results summarized in the table below. The metrics of interest are mean Average Precision (mAP, measuring accuracy) and Frames Per Second (FPS, measuring speed).

Algorithm	mAP (%)	FPS
YOLOv10 (Ours)	97.6	68.96
YOLOv8	88.3	59.5
YOLOv7	82.1	50.8
YOLOv4	83.9	28.0

The data clearly shows that YOLOv10 achieves a superior balance, offering the highest accuracy (97.6% mAP) and the fastest processing speed (68.96 FPS). This performance is crucial for the real-time demands of an anti-UAV system, enabling it to reliably detect small, fast-moving targets. When integrated with the SORT tracker, the system successfully maintained stable tracks on UAVs flying at speeds up to 23 m/s at distances up to 1 km, with a smooth video output of 30 FPS, demonstrating robust tracking capability for dynamic anti-UAV scenarios.

The semantic generation module was tested for the relevance and accuracy of its textual output. The importance of the Prompt Template was verified through ablation studies. Without a template, descriptions were often generic or misaligned with the operational need. With the structured template, outputs became consistent and focused on security-relevant details. For example, a generated warning might read: “Scene Context: The view shows a clear blue sky above an urban park with several high-rise buildings in the distance. UAV Description: The detected object is a quadcopter UAV with a white body and black rotors. Location: The UAV is positioned in the central-right area of the screen, coordinates (120, -45) pixels.” This triple-faceted description provides a commander with immediate, bandwidth-light situational awareness that is far more informative than a simple alarm or data packet.

Conclusion

This paper presented a novel, intelligent real-time early warning system designed to address the dual challenges of reliable UAV detection and effective warning communication in diverse operational environments. By synergistically integrating the YOLOv10 object detector for high-speed visual perception, the SORT algorithm for robust motion tracking, and the GLM-4V multimodal model for semantic understanding, the system establishes a complete “detection-analysis-communication” pipeline for anti-UAV operations. Its key innovation is the dynamic adaptation of warning outputs based on network conditions, seamlessly switching between graphical data and compressed semantic text. Extensive experimental results confirm the system’s high accuracy (97.6% mAP), real-time performance, and effective tracking range. The ability to generate concise, context-rich textual warnings ensures operational continuity even in severely bandwidth-constrained scenarios, a common limitation in remote or tactical anti-UAV deployments. This work provides a practical and scalable framework for enhancing low-altitude security, offering a significant step towards more intelligent and resilient anti-UAV defense infrastructures.