The proliferation of uncrewed aerial vehicles (UAVs) has introduced significant security challenges, particularly from unauthorized or “rogue” flights in sensitive low-altitude airspace. While computer vision-based detection offers a cost-effective solution, its reliance on high-bandwidth video transmission becomes a critical bottleneck in remote or communication-constrained environments typical of border security and critical infrastructure protection. To address this, we present an intelligent real-time anti-drone early-warning system that synergistically integrates the YOLOv10 object detection framework with the GLM-4V multimodal large language model. This system establishes a “visual detection-semantic analysis-dynamic transmission” technological闭环, enabling robust drone intrusion warning under varying network conditions.

Our hardware platform comprises high-sensitivity electro-optical detection equipment, including visible-light and thermal cameras, coordinated with a central computing unit. The software architecture integrates a client-side interactive platform built on PyQt5 and a core algorithmic module for multimodal data processing. At its heart, the YOLOv10 network performs rapid and precise drone localization. Its predictions are fed into a SORT (Simple Online and Realtime Tracking) module for robust multi-target tracking across video frames. The key innovation lies in our semantic analysis layer. We designed a structured Prompt template for the GLM-4V model, enabling it to convert visual detection results—including target bounding boxes, classified drone attributes, and contextual background information—into concise, human-readable textual descriptions in real-time.
This semantic transformation drastically reduces the data payload required for warning transmission. Under optimal network conditions, the system can transmit structured detection data (coordinates, confidence scores). In bandwidth-limited scenarios, it seamlessly switches to transmitting lightweight semantic text (e.g., “A black quadcopter with red warning lights is detected in the upper-left quadrant of the field of view, against a backdrop of clear sky and dense forestry”). Experimental validation under challenging conditions—target drones at distances up to 1 km and speeds reaching 23 m/s—demonstrates a mean Average Precision (mAP) of 97.6% for detection. The system effectively provides a high-robustness solution for anti-drone surveillance, balancing detection accuracy, real-time performance, and communication adaptability.
1. Introduction and System Overview
The strategic value of UAVs in sectors like aerial photography, surveying, and logistics is indisputable. However, this utility is paralleled by growing security threats from their malicious or negligent use near restricted airspace, military installations, and public venues. Traditional counter-Unmanned Aerial Systems (C-UAS) or anti-drone technologies encompass radar, radio frequency (RF) sensing, acoustic arrays, and electro-optical/infrared (EO/IR) systems. Among these, vision-based methods offer unique advantages: they are passive, provide rich contextual information, are relatively low-cost, and yield visually interpretable results, making them highly suitable for urban and perimeter security.
Recent advances in deep learning, particularly convolutional neural networks (CNNs), have dramatically improved the capability of visual anti-drone systems. The YOLO (You Only Look Once) family of algorithms, known for their speed and accuracy, is frequently employed. For instance, lightweight modifications to YOLOv5 or YOLOv7 have been explored to enhance performance on embedded platforms. However, a persistent challenge for purely visual systems is their operational dependency on continuous, high-bandwidth video feeds to a command center. In scenarios with poor or intermittent connectivity—such as remote borders, mountainous regions, or during electronic warfare—this dependency becomes a critical failure point, hindering real-time situational awareness and response.
Multimodal Large Language Models (MLLMs) like GLM-4V present a transformative opportunity. These models can comprehend and generate content across different modalities (text, image, audio). By leveraging an MLLM, a visual anti-drone system can go beyond simple bounding box output. It can generate semantically rich, textual descriptions of the scene, encapsulating not just the drone’s presence and location, but also its appearance, behavior, and environmental context. This textual information requires orders of magnitude less bandwidth to transmit than video streams or even compressed image frames.
Our proposed system synthesizes these technological strands. It utilizes a high-performance visual detection and tracking engine (YOLOv10+SORT) for reliable drone identification and kinematic estimation. Crucially, it incorporates a GLM-4V-based semantic analysis module that acts as an “intelligent compressor,” translating visual data into actionable textual alerts. The system’s operational workflow is depicted in the following high-level diagram and described thereafter.
The core process is as follows: The pan-tilt-zoom (PTZ) unit with dual-spectrum cameras continuously monitors the designated airspace. Captured video frames are streamed to the central processing computer. The YOLOv10 detector analyzes each frame to locate and classify drone targets. The SORT tracker associates detections across frames to maintain unique IDs and estimate trajectories. Simultaneously, relevant visual data (the full frame for context and a cropped image of the detected drone) is processed by the GLM-4V model, guided by structured prompts, to generate three textual elements: a description of the background environment, a description of the drone’s visual characteristics, and a textual report of its positional coordinates. All information—visual annotations and textual alerts—is displayed on the client interface. Based on the drone’s tracked pixel coordinates, the system can also generate control commands to slew the PTZ unit for automatic, persistent tracking of the intruder.
2. Algorithmic Core: Detection, Tracking, and Semantic Analysis
2.1 Enhanced Detection and Tracking with YOLOv10 and SORT
To meet the stringent real-time demands of an anti-drone early-warning system, we selected a tracking-by-detection paradigm. The SORT algorithm provides a lightweight yet effective framework for multi-object tracking. We significantly enhanced its upstream detection component by replacing the originally used two-stage detector with the state-of-the-art one-stage YOLOv10 network, resulting in superior speed and accuracy.
YOLOv10 as the Detector: YOLOv10 introduces architectural refinements over its predecessors, including optimized model design choices and training strategies that eliminate non-maximum suppression (NMS) during inference, reducing computational latency. For our anti-drone task, it delivers an optimal balance, achieving high precision in identifying small, distant drones while maintaining the frame rates necessary for real-time response.
SORT for Tracking: The SORT algorithm operates in a simple yet effective cycle of detection, prediction, and association. Its workflow is:
1. Detection: For each frame at time \(t\), the YOLOv10 network generates a set of detections \(D_t\), each defined by a bounding box \( (x, y, w, h) \).
2. Prediction: A Kalman Filter predicts the new location of all existing tracks \(T_{t-1}\) from the previous frame. The Kalman Filter models the target’s motion state (position, velocity) and updates its estimate based on a constant velocity model.
3. Data Association: The Hungarian algorithm is used to associate the predicted bounding boxes from the Kalman Filter with the newly detected bounding boxes from YOLOv10, minimizing the overall IoU (Intersection over Union) distance or similar cost metric.
4. Update: Associated detections are used to update the corresponding track’s state in the Kalman Filter. Unassociated detections initiate new tracks, while unassociated tracks are tentatively kept for a short period before being removed.
The Kalman Filter is central to SORT’s predictive capability. Its recursive process involves two main steps: Prediction and Update.
Prediction Step: This step forecasts the current state based on the previous state.
$$ \hat{\mathbf{x}}_{t|t-1} = \mathbf{F}_t \hat{\mathbf{x}}_{t-1|t-1} + \mathbf{B}_t \mathbf{u}_t $$
$$ \mathbf{P}_{t|t-1} = \mathbf{F}_t \mathbf{P}_{t-1|t-1} \mathbf{F}_t^T + \mathbf{Q}_t $$
Here, \(\hat{\mathbf{x}}_{t|t-1}\) is the prior state estimate (predicted position and velocity), \(\mathbf{F}_t\) is the state transition matrix, \(\hat{\mathbf{x}}_{t-1|t-1}\) is the posterior estimate from the last step, \(\mathbf{B}_t\) is the control-input model, and \(\mathbf{u}_t\) is the control vector. \(\mathbf{P}_{t|t-1}\) is the prior estimate covariance, and \(\mathbf{Q}_t\) is the process noise covariance.
Update Step: This step refines the prediction using the new measurement (detection).
$$ \mathbf{y}_t = \mathbf{z}_t – \mathbf{H}_t \hat{\mathbf{x}}_{t|t-1} $$
$$ \mathbf{S}_t = \mathbf{H}_t \mathbf{P}_{t|t-1} \mathbf{H}_t^T + \mathbf{R}_t $$
$$ \mathbf{K}_t = \mathbf{P}_{t|t-1} \mathbf{H}_t^T \mathbf{S}_t^{-1} $$
$$ \hat{\mathbf{x}}_{t|t} = \hat{\mathbf{x}}_{t|t-1} + \mathbf{K}_t \mathbf{y}_t $$
$$ \mathbf{P}_{t|t} = (\mathbf{I} – \mathbf{K}_t \mathbf{H}_t) \mathbf{P}_{t|t-1} $$
Here, \(\mathbf{z}_t\) is the measurement vector (detected bounding box), \(\mathbf{H}_t\) is the observation matrix, \(\mathbf{R}_t\) is the measurement noise covariance, \(\mathbf{y}_t\) is the innovation, \(\mathbf{S}_t\) is its covariance, and \(\mathbf{K}_t\) is the optimal Kalman gain. The final outputs \(\hat{\mathbf{x}}_{t|t}\) and \(\mathbf{P}_{t|t}\) are the updated state estimate and its covariance.
This combination allows our anti-drone system to not only detect drones in a single frame but also maintain their identity and smooth their trajectory across the video sequence, which is vital for assessing threat level and guiding countermeasures.
2.2 Multimodal Semantic Analysis for Alert Generation
The second pillar of our intelligent anti-drone system is the semantic interpretation of the visual scene using the GLM-4V multimodal model. This module addresses the bandwidth challenge by converting visual data into compact, informative text.
Structured Prompting for Context Description: To generate accurate and consistent descriptions of the background environment, we employ engineered Prompt templates. Directly querying an MLLM with an image can yield generic descriptions. By using a structured template, we guide the model to focus on specific, relevant aspects of the scene crucial for situational assessment in an anti-drone context.
Example Template: *”Describe the background scene in this image. The current view shows {…}. The sky appears {…}. In the distance, one can see {…}. In the foreground, there is {…}.”*
Feeding an image along with this template to the GLM-4V API yields a structured description like: *”The current view shows a dense forest with lush tree canopies. The sky is a clear blue without clouds. In the distance, one can see the top of a structure, likely a tower or lookout post. In the foreground, the outlines of trees are清晰 against the sky.”* This provides immediate context about the drone’s operating environment (e.g., urban, rural, over water).
Drone-Centric Description Generation: To describe the intruder itself, a two-step process is used:
1. The bounding box coordinates from YOLOv10 are used to crop the drone from the original image. This close-up image focuses the MLLM on the target.
2. This cropped image is fed to GLM-4V, often with a simpler prompt like “Describe this unmanned aerial vehicle in detail.” The model then generates a description of the drone’s physical attributes, for example: *”The UAV has four rotors, a black body, and features red indicator lights.”* This aids in drone type recognition and forensic logging.
Positional Alert Formulation: The system automatically generates a precise textual location alert. The image coordinate system is defined with the origin (0,0) at the frame’s center. For a detection at pixel coordinates (x, y), the alert text is formatted as: *”Drone detected in the upper-left quadrant. Relative to image center, coordinates are (x, y) pixels.”* This provides actionable location data without transmitting the image.
The synthesis of these three textual components—background context, drone appearance, and precise location—forms a comprehensive semantic alert that can be transmitted as a small data packet, enabling effective anti-drone warning even over low-bandwidth tactical networks.
3. System Implementation and Experimental Validation
3.1 Hardware and Software Configuration
The practical deployment of our anti-drone system relies on a robust hardware-software stack. The key components are summarized below.
| Component | Model/Specification | Key Parameters |
|---|---|---|
| Thermal Camera | XCore FTII Series | Resolution: 1280×1024 pixels |
| Visible Camera | UV-ZN4237 | Resolution: 1920×1080 pixels |
| Processing Unit | Desktop Computer | CPU: Intel i7-12700; GPU: NVIDIA GTX Series |
| PTZ Mechanism | Custom Assembly | Range: Pan 360°, Tilt -60° to +60° |
The client software is developed using PyQt5, providing an intuitive graphical user interface (GUI). The main interface is divided into functional panels: a control panel for source selection (image/video/camera), a display panel showing original and processed video with tracking annotations, a dedicated panel for zoomed drone view, and a text panel presenting the generated semantic alerts (background, drone info, location).
3.2 Performance Evaluation of Detection and Tracking
We evaluated our detection algorithm on a custom-built dataset containing various drone models (e.g., DJI Mavic Air 2, Phantom 4 Pro) in diverse environments. We compared YOLOv10 against several prominent predecessors. The performance metrics are mean Average Precision (mAP@0.5) and frames per second (FPS), measured on a test bed with an NVIDIA GTX 1080 Ti GPU.
| Algorithm | mAP (%) | FPS | Notes |
|---|---|---|---|
| YOLOv10 (Ours) | 97.6 | 68.96 | Best balance of accuracy and speed |
| YOLOv8 | 88.3 | 59.50 | ~10.5% lower mAP than v10 |
| YOLOv7 | 82.1 | 50.80 | ~18.9% lower mAP than v10 |
| YOLOv4 | 83.9 | 28.00 | Significantly slower inference |
The results clearly demonstrate YOLOv10’s superiority for this anti-drone application, offering a 9.3 percentage point improvement in mAP over YOLOv8 and a 35.7% increase in FPS over YOLOv7. This ensures both high detection reliability and the real-time throughput needed for tracking. The integrated YOLOv10+SORT pipeline successfully maintained track on drones flying at speeds up to 23 m/s at ranges approaching 1 km, with a stable output of 30 FPS, confirming its operational readiness for field deployment.
3.3 Effectiveness of Semantic Alert Generation
The utility of the Prompt template was validated through comparative tests. Without a template, GLM-4V’s descriptions could be vague or omit critical contextual details. With our engineered template, the output became structured, comprehensive, and focused on elements relevant for security assessment (sky conditions, terrain, man-made structures). The drone-centric descriptions from cropped images were consistently accurate in identifying basic features like rotor count and color. The positional text, derived directly from the detection coordinates, provided unambiguous location data. In field tests conducted in a campus environment, the system successfully detected, tracked, and generated accurate semantic alerts for drones during both day and night operations (using the thermal camera at night).
4. Discussion and Conclusion
The proposed system represents a significant step forward in adaptive anti-drone technology. By fusing state-of-the-art visual perception with advanced multimodal semantic understanding, it overcomes a fundamental limitation of traditional vision-based systems: their vulnerability to degraded communication links. The system intelligently adapts its output modality based on network availability. In high-bandwidth scenarios, it provides rich visual feedback. When bandwidth is constrained, it switches to transmitting lightweight semantic alerts that encapsulate the essential information—”what is where, and what does it look like in what context.”
The core innovation is the structured use of an MLLM as an intelligent information compressor and interpreter within an anti-drone pipeline. This approach offers several key advantages:
1. Bandwidth Resilience: Enables continuous operation and alerting in low-bandwidth or intermittent connectivity environments, which are common in many critical security scenarios.
2. Enhanced Situational Awareness: Textual descriptions provide intuitive summaries that can be quickly parsed by human operators and easily logged or shared via text-based communication channels (e.g., tactical radios, satellite SMS).
3. Actionable Intelligence: The combination of precise coordinates, drone description, and environmental context supports better decision-making for threat assessment and response coordination.
Future work will focus on several enhancements. Firstly, optimizing the GLM-4V inference for lower latency on edge computing devices is crucial for fully decentralized operations. Secondly, expanding the semantic analysis to include intent recognition (e.g., “drone is loitering,” “drone is approaching a sensitive perimeter at high speed”) would add a predictive layer to the warning system. Finally, integrating this visual-semantic subsystem with other sensing modalities like RF detection in a sensor fusion framework would create a more comprehensive and robust anti-drone defense network.
In conclusion, this intelligent real-time anti-drone early-warning system successfully bridges the gap between high-performance visual detection and the practical demands of field deployment in communication-denied environments. Its hybrid architecture, leveraging YOLOv10 for speed and accuracy and GLM-4V for semantic compression, provides a scalable, robust, and practical solution for safeguarding low-altitude airspace against unauthorized UAV intrusions.
