1. Introduction
The proliferation of Unmanned Aerial Vehicles (UAVs) has revolutionized sectors like aerial photography and geographic surveying. However, unauthorized UAV operations (“black flights”) pose severe threats to national security and public safety. Traditional detection methods—acoustic, radio frequency, radar, and optical—face limitations in complex environments. Optical computer vision techniques offer cost-effective, flexible solutions but struggle with high-bandwidth requirements in communication-constrained scenarios (e.g., border surveillance). This work addresses these gaps by integrating the YOLOv10 object detector and the GLM-4V multimodal large language model (LLM) to create a real-time anti-Unmanned Aerial Vehicle early warning system. Our system dynamically switches between structured data and semantic text transmission, optimizing performance across varying network conditions.

2. System Architecture
2.1 Hardware Configuration
The hardware comprises electro-optical sensors, a central computing unit, and a client interface (Table 1). Dual-spectrum cameras capture visible and infrared data, enabling 24/7 UAV monitoring.
Table 1: Core Hardware Specifications
Component | Model | Key Parameters |
---|---|---|
Infrared Camera | IRay XCore FTII | Resolution: 1280×1024 |
Visible-light Camera | Huanyu WeiShi UV-ZN4237 | Resolution: 1920×1080 |
Central Computer | Custom-built | CPU: Intel i7-12700 |
2.2 Software Framework
The PYQT5-based client integrates:
- Input Modules: Real-time video/image/camera feeds.
- Processing Core: YOLOv10 detection + SORT tracking + GLM-4V semantic analysis.
- Output Interface: Visual UAV annotations, textual warnings (position, attributes, environment).
3. Algorithmic Design
3.1 UAV Detection & Tracking
YOLOv10 achieves real-time UAV localization, replacing slower two-stage detectors (e.g., Faster R-CNN). Its architecture minimizes computational latency while maximizing accuracy. Detections feed into the SORT algorithm for robust trajectory prediction:
- Kalman Filter Prediction:x^k−=Fx^k−1+Buk−1x^k−=Fx^k−1+Buk−1Pk−=FPk−1FT+QPk−=FPk−1FT+Qwhere x^k−x^k− is the predicted state, FF the state transition matrix, BB the control matrix, and Pk−Pk− the predicted covariance.
- Measurement Update:Kk=Pk−HT(HPk−HT+R)−1Kk=Pk−HT(HPk−HT+R)−1x^k=x^k−+Kk(zk−Hx^k−)x^k=x^k−+Kk(zk−Hx^k−)Pk=(I−KkH)Pk−Pk=(I−KkH)Pk−Here, KkKk is the Kalman gain, zkzk the measurement, and HH the observation matrix.
- Hungarian Algorithm: Associates detections with tracked UAVs using IoU metrics.
3.2 Multimodal Semantic Analysis
To reduce bandwidth, GLM-4V converts visual data into compressed text descriptions:
- Structured Prompt Template:text复制下载”Detected Unmanned Aerial Vehicle at coordinates (x, y) relative to image center. UAV Description: [GLM-4V output from cropped UAV image]. Scene Context: [GLM-4V output using environmental template].”
- Dynamic Transmission:
- High bandwidth: Transmit bounding boxes + confidence scores.
- Low bandwidth: Transmit semantic text only.
Table 2: Prompt Templates for Environmental Context
Template | Purpose |
---|---|
“This view shows [sky description]. Distant elements: [ ]. Near elements: [ ].” | Standardizes background reporting |
“The scene depicts [terrain type]. The sky appears [color/texture]. Critical landmarks: [ ].” | Enhances positional awareness |
4. Experimental Validation
4.1 Detection & Tracking Performance
Testing used a custom UAV dataset under Linux/PyTorch. YOLOv10 outperformed predecessors (Table 3):
Table 3: UAV Detection Algorithm Comparison
Algorithm | mAP (%) | FPS | ΔmAP (vs. YOLOv10) | ΔFPS (vs. YOLOv10) |
---|---|---|---|---|
YOLOv10 | 97.6 | 68.96 | – | – |
YOLOv8 | 88.3 | 59.5 | -9.3 | -9.46 |
YOLOv7 | 82.1 | 50.8 | -15.5 | -18.16 |
YOLOv4 | 83.9 | 28.0 | -13.7 | -40.96 |
Tracking stability was maintained for UAVs at 1 km distance moving at 23 m/s (30 FPS output).
4.2 Semantic Efficiency
- Text Accuracy: Structured prompts improved scene relevance by >80% vs. unstructured generation (Fig 7).
- Bandwidth Reduction: Semantic text used 95% less bandwidth than image transmission.
4.3 Field Tests
Deployed on DJI Phantom 4 Pro and Mavic Air 2 UAVs in campus environments:
- Day/Night operation: Successful detection at 1 km range.
- Latency: <100 ms from detection to warning generation.
5. Mathematical Analysis of Efficiency
The system’s bandwidth savings are quantifiable. Let II represent raw image data, TsTs semantic text, and BdBd detection metadata (boxes, scores). The transmission cost ratio is:Γ=size(Ts)size(I)≈0.05(Low-BW Mode)Γ=size(I)size(Ts)≈0.05(Low-BW Mode)Γ=size(Bd)size(I)≈0.15(High-BW Mode)Γ=size(I)size(Bd)≈0.15(High-BW Mode)
The decision to switch modes depends on available bandwidth BWavailBWavail:Mode={Semanticif BWavail<ΘStructuredotherwiseMode={SemanticStructuredif BWavail<Θotherwise
where ΘΘ is a threshold (empirically set to 5 Mbps).
6. Conclusion
This work presents a real-time anti-Unmanned Aerial Vehicle early warning system that synergizes YOLOv10 for high-speed UAV detection and GLM-4V for bandwidth-efficient semantic analysis. Key innovations include:
- 97.6% mAP accuracy under extreme conditions (1 km range, 23 m/s UAV speed).
- Dynamic data transmission reducing bandwidth by 95% in constrained environments.
- Structured Prompt templates ensuring actionable, context-aware warnings.
The system’s “visual detection → semantic analysis → dynamic transmission” pipeline offers a scalable solution for safeguarding critical infrastructure and border regions against unauthorized Unmanned Aerial Vehicle activities. Future work will integrate radar/radio frequency sensors for multi-modal fusion.