Real-time Anti-Drone Early Warning System Based on Intelligent Visual Detection and Semantic Analysis

In recent years, the rapid advancement of drone technology has revolutionized various sectors, including surveillance, agriculture, and logistics, by providing cost-effective and flexible solutions. Unmanned Aerial Vehicles (UAVs) have become integral to modern applications due to their ability to access hard-to-reach areas and perform tasks with high precision. However, the proliferation of drone technology has also introduced significant security risks, such as unauthorized flights over sensitive areas, illegal surveillance, and potential threats to public safety. These incidents, often referred to as “black flights” or “erratic flights,” highlight the urgent need for robust counter-drone systems capable of real-time detection and warning. Traditional methods for drone detection, including acoustic, radio frequency, and radar-based systems, face limitations in complex environments, such as high false alarm rates, susceptibility to interference, and high costs. Optical detection techniques, leveraging computer vision, offer a promising alternative due to their low cost, flexibility, and intuitive results. However, in communication-constrained scenarios like border surveillance or military operations, transmitting high-bandwidth video data for real-time analysis becomes challenging. To address this, we propose an innovative anti-drone early warning system that integrates the YOLOv10 object detection algorithm with the GLM-4V multimodal large model. This system not only achieves high accuracy in detecting and tracking Unmanned Aerial Vehicles but also generates semantic descriptions of detected targets, reducing data transmission loads in low-bandwidth conditions while maintaining real-time performance.

The core of our system lies in its hardware and software architecture, designed to handle the demands of real-time drone detection and semantic analysis. The hardware comprises high-sensitivity electro-optical detection equipment, including infrared and visible-light cameras, paired with a central computing unit. The infrared camera, such as the XCore FTII series, excels in low-visibility conditions, while the visible-light camera captures high-resolution images during daytime operations. The main computer, equipped with a powerful processor, runs the client software that integrates the YOLOv10 algorithm for target detection, the SORT algorithm for trajectory tracking, and the GLM-4V model for multimodal data processing. This setup ensures that the system can process video streams efficiently, even under extreme conditions, such as detecting drones at distances up to 1 km and speeds of 23 m/s. The software, built on the PYQT5 framework, provides an intuitive user interface for real-time visualization, control, and text-based warnings. By combining these elements, our system forms a closed-loop process of “visual detection-semantic analysis-dynamic transmission,” enabling adaptive responses to varying network conditions. In bandwidth-rich environments, structured data like bounding boxes and confidence scores are transmitted, whereas in low-bandwidth scenarios, semantic text descriptions generated by GLM-4V are used, significantly reducing communication overhead without compromising alert accuracy.

To achieve real-time detection and tracking of Unmanned Aerial Vehicles, we employ the YOLOv10 algorithm as the primary object detection network, known for its balance of speed and accuracy. YOLOv10 builds upon previous versions by incorporating architectural improvements that reduce computational complexity while maintaining high detection performance. For instance, it utilizes a lightweight backbone and enhanced feature fusion mechanisms, allowing it to process images at high frame rates. The detection process involves dividing the input image into a grid and predicting bounding boxes, class probabilities, and objectness scores for each grid cell. The output includes the coordinates of detected drones, which are then fed into the SORT algorithm for tracking. SORT uses a Kalman filter to model the motion of targets and predict their future positions, combined with the Hungarian algorithm for data association between frames. This ensures consistent tracking of multiple drones across video sequences, even in dynamic environments. The Kalman filter equations are central to this process, as shown below:

$$ \hat{x}_t^- = F \hat{x}_{t-1} + B u_{t-1} $$

$$ P_t^- = F P_{t-1} F^T + Q $$

$$ Z_t = H x_t + e_t $$

$$ P_t = (I – K_t H) P_t^- $$

$$ \hat{x}_t = \hat{x}_t^- + K_t (Z_t – H \hat{x}_t^-) $$

$$ K_t = \frac{P_t^- H^T}{H P_t^- H^T + R} $$

Here, $\hat{x}_t^-$ represents the prior state estimate, $F$ is the state transition matrix, $B$ is the control matrix, $u_{t-1}$ is the control input, $P_t^-$ is the prior error covariance, $Q$ is the process noise covariance, $Z_t$ is the measurement vector, $H$ is the observation matrix, $e_t$ is the measurement noise, $P_t$ is the posterior error covariance, $K_t$ is the Kalman gain, and $R$ is the measurement noise covariance. These equations enable the system to predict and update the position of drones accurately, facilitating robust tracking in real-time. The integration of YOLOv10 and SORT ensures that our system can handle high-speed Unmanned Aerial Vehicles with minimal latency, making it suitable for critical applications like border security and infrastructure protection.

In addition to visual detection, our system incorporates the GLM-4V multimodal large model to generate semantic descriptions of detected drones and their surroundings. This addresses the challenge of transmitting large amounts of visual data in bandwidth-limited scenarios. GLM-4V is capable of processing multiple data types, including images and text, allowing it to generate concise and informative descriptions based on visual inputs. We design structured prompt templates to guide the model in producing accurate and relevant text outputs. For example, a prompt template for background description might be: “The current view shows {sky description}, with {elements in the sky}. In the distance, {distant objects} are visible, while in the foreground, {nearby objects} can be seen.” This structured approach ensures that the generated descriptions are consistent and focused on key elements, such as the drone’s location, appearance, and environmental context. Similarly, for drone-specific information, we crop the detected drone regions from the image and input them into GLM-4V with prompts like “Describe the drone’s features, including its color, propellers, and any markings.” The output text, which includes coordinates and semantic details, is then used for warnings, reducing the data load compared to transmitting full images or videos. This multimodal processing not only enhances the system’s adaptability but also provides human-readable insights that aid in decision-making, such as assessing the intent of a drone’s flight path.

To evaluate the performance of our system, we conducted extensive experiments using a custom dataset of drone images and videos. The dataset included various scenarios, such as different lighting conditions, backgrounds, and drone models, to test the robustness of the detection and tracking algorithms. We compared YOLOv10 with other state-of-the-art detectors, including YOLOv8, YOLOv7, and YOLOv4, using metrics like mean Average Precision (mAP) and Frames Per Second (FPS). The results, summarized in the table below, demonstrate the superiority of YOLOv10 in terms of both accuracy and speed.

Algorithm	mAP (%)	FPS (frames/s)
YOLOv10	97.6	68.96
YOLOv8	88.3	59.5
YOLOv7	82.1	50.8
YOLOv4	83.9	28.0

As shown, YOLOv10 achieved an mAP of 97.6%, outperforming YOLOv8 by 9.3 percentage points and YOLOv7 by 15.5 percentage points. In terms of speed, YOLOv10 reached 68.96 FPS, which is 15.9% faster than YOLOv8 and 35.7% faster than YOLOv7. This high performance ensures that the system can detect and track drones in real-time, even under challenging conditions. For tracking, the SORT algorithm maintained stable trajectories with green bounding boxes visualized in the output, as illustrated in the experimental results. In tests involving drones flying at speeds up to 23 m/s and distances of 1 km, the system consistently achieved a frame rate of 30 FPS, confirming its capability for real-world deployment. The use of drone technology in these tests highlighted the system’s effectiveness in handling dynamic scenarios, which is crucial for applications requiring immediate response to potential threats.

Furthermore, we assessed the semantic description generation using GLM-4V with and without prompt templates. The introduction of structured prompts significantly improved the accuracy and relevance of the generated text. For instance, without prompts, descriptions might be vague or miss critical details, but with prompts, the model produced coherent outputs like “The current view features a clear blue sky with a small black drone equipped with four propellers, located in the upper-left quadrant at coordinates (-547.0, 131.5) pixels.” This demonstrates how prompt engineering enhances the multimodal model’s utility in practical settings. The table below summarizes the impact of prompt templates on description quality, based on qualitative evaluations.

Scenario	Without Prompt	With Prompt
Background Description	Vague or irrelevant details	Structured and context-aware
Drone Information	Incomplete features	Detailed attributes (e.g., color, propellers)
Position Accuracy	Approximate locations	Precise coordinates and orientation

These findings underscore the importance of multimodal integration in reducing bandwidth requirements while maintaining informative outputs. In low-bandwidth environments, transmitting such text descriptions instead of raw images can save up to 90% of data volume, based on typical compression ratios, making the system highly efficient for remote or resource-constrained settings. The adaptability of drone technology to these innovations further emphasizes the system’s potential in diverse operational contexts.

Finally, we integrated the entire system into a client software platform for real-world testing. The software, developed with PYQT5, includes modules for image input (from cameras, videos, or pictures), detection visualization, and text display. Users can interact with the interface to monitor detected drones, view their trajectories, and read semantic warnings. During field tests in campus environments, the system successfully detected and tracked Unmanned Aerial Vehicles like the DJI Phantom 4 Pro and Mavic Air 2, both during daytime and nighttime operations. The hardware, with its pan-tilt capabilities, allowed the cameras to follow drones smoothly, while the software provided real-time updates on drone positions and descriptions. For example, in one test, the system generated a warning text: “Detected a drone in the upper-left area with coordinates (-500, 120); the drone has a black body and red lights, flying over a wooded background.” This level of detail enables operators to make informed decisions quickly, whether for interception or further monitoring. The system’s performance in these scenarios confirms its robustness and readiness for deployment in critical areas, such as airports or government facilities, where drone technology poses ongoing security challenges.

In conclusion, our anti-drone early warning system represents a significant advancement in countering the threats posed by Unmanned Aerial Vehicles through the fusion of cutting-edge computer vision and multimodal AI. By leveraging YOLOv10 for high-speed detection, SORT for reliable tracking, and GLM-4V for semantic analysis, we have created a solution that excels in accuracy, speed, and adaptability. The system’s ability to switch between data transmission modes based on network conditions ensures its applicability in both high-bandwidth and communication-limited environments. Experimental results validate its effectiveness, with a 97.6% mAP and real-time performance under extreme conditions. As drone technology continues to evolve, our system provides a scalable and robust framework for enhancing low-altitude security, offering a proactive approach to mitigating risks associated with unauthorized UAV activities. Future work could focus on integrating additional sensors or expanding the multimodal capabilities to handle more complex scenarios, further solidifying the role of AI in safeguarding airspace.