Recent years have witnessed the rapid development of Unmanned Aerial Vehicles (UAVs), which have been widely applied in civilian, industrial, surveying, and many other sectors, providing new pathways for enhancing productivity and efficiency. However, the proliferation of UAVs has also brought about lagging regulatory systems. Unauthorized flights of UAV drones not only easily lead to personal privacy leakage but may also intrude into no-fly zones, causing public safety hazards and directly threatening social governance and public security. Therefore, fast and accurate detection of low-altitude rogue UAV drones and timely risk identification have become critical research topics in low-altitude security. Particularly in small and medium-sized cities, limited by hardware deployment costs and computing resources, there is an urgent need for a UAV drone detection solution that balances detection accuracy, inference efficiency, and deployment economy. With breakthroughs in deep learning for computer vision, visual-based object detection methods have become the mainstream technical route for UAV drone detection due to their low cost, fast detection, and strong adaptability.

In the current research field of visual UAV drone detection, the combination of deep learning and computer vision techniques has become the mainstream solution. Many studies have explored UAV drone detection using various deep learning architectures. For instance, some work employs RetinaNet with multi-scale feature extraction to handle size variations, using Res2Net as the backbone to extract features from multiple receptive fields and designing a hybrid feature pyramid structure. Others use super-resolution combined with Faster R-CNN to improve recall, enlarging the image by a factor of two before detection. Background subtraction combined with enhanced YOLOv5s has also been attempted, though it relies heavily on static scenes and is insufficient in complex environments. Despite these advances, challenges remain: small UAV drones are difficult to identify precisely, and they are easily disturbed by complex backgrounds. To address these challenges, we propose AU-YOLO, a low-altitude UAV drone detection algorithm that achieves excellent performance in real-world scenarios and can effectively detect small UAV drones and swarms.
Proposed Method
We base our model on YOLOv11 and introduce two key modules: the Multi-Scale Edge Selection (MSES) module and the Dynamic Multi-head Temporal-Spatial Attention (DMTSA) module. The overall architecture of AU-YOLO is illustrated as follows. The backbone extracts multi-scale features, where we replace the original C3k2 bottlenecks with our MSES module in certain layers to enhance edge information. The neck incorporates DMTSA modules to refine high-level features with dynamic activation and temporal attention. The head remains the same as YOLOv11 for final detection.
MSES Module
UAV drones in images often exhibit low pixel density and weak edge characteristics. The original Bottleneck in C3k2 uses a fixed combination of 1×1 and 3×3 convolutions, without any specialized design for edge structures of UAV drones. This results in single-dimensional feature extraction and severe loss of edge details, which degrades detection accuracy. To address this, we design the MSES module inspired by the Dual Domain Selection Mechanism (DSM) originally proposed for image restoration. The MSES module consists of an Edge Feature Extraction (EFE) sub-component and a Dual Domain Selection (DSM) sub-component. EFE uses multi-scale convolutions to capture edges at different scales, while DSM performs spatial and frequency selection to enhance high-frequency edge features of UAV drones.
The DSM mechanism comprises two parts: the Spatial Selection Module (SSM) and the Frequency Selection Module (FSM). Given an input feature map $$F_0 \in \mathbb{R}^{H \times W \times C}$$, SSM first performs channel-wise average pooling and a convolution to generate a spatial degradation location feature $$F_1 \in \mathbb{R}^{H \times W \times 1}$$. Then, $$F_1$$ is tiled to $$F_{\text{Tile}} \in \mathbb{R}^{H \times W \times C}$$, and a depthwise convolution on $$F_0$$ produces channel-level degradation features $$F_{\text{channel}}$$. The final spatial selection feature is obtained as:
$$F_g = F_{\text{channel}} \odot F_{\text{Tile}}$$
where $$\odot$$ denotes element-wise multiplication. This operation enhances the response of degraded regions (where UAV drones may appear) while suppressing background. FSM then further refines the feature by removing low-frequency components and amplifying high-frequency details. It performs channel-wise global average pooling to generate low-frequency features $$F_{\text{low}} \in \mathbb{R}^{1 \times 1 \times C}$$, broadcasts to $$F_{\text{Broad}} \in \mathbb{R}^{H \times W \times C}$$, and combines with a high-frequency separation operation similar to EFE to produce the final enhanced feature $$F_{\text{sh}}$$.
The MSES module is designed to seamlessly replace the Bottleneck in C3k2 without changing the backbone structure. It retains the residual design to maintain feature continuity and uses only simple convolutions and pooling, keeping the model lightweight. The following table summarizes the configuration of MSES compared to the original Bottleneck.
| Component | Bottleneck (Original) | MSES |
|---|---|---|
| Convolution layers | 1×1 + 3×3 | Multi-scale depthwise conv + 1×1 |
| Edge enhancement | None | EFE + DSM (spatial & frequency) |
| Residual connection | Yes | Yes |
| Parameter efficiency | Standard | Lightweight (depthwise) |
DMTSA Module
The original C2PSA module in YOLOv11 suffers from two problems when applied to UAV drone detection. First, it uses static activation functions, which fail to adapt to complex weather conditions (e.g., backlight, cloudy) where UAV drones exhibit weak features and strong background noise. Second, the original PSA is a single-frame spatial attention that cannot capture the motion trend of fast-moving UAV drones, leading to attention failure. To solve these, we propose the DMTSA module that integrates Dynamic Tanh activation, a joint temporal attention mechanism combining TSSA and InfLLM-V2, and a lightweight Mona adapter.
Dynamic Tanh. We introduce Dynamic Tanh (DyT), an improved version of Tanh that adaptively adjusts activation based on input statistics:
$$\text{DyT}(x) = \tanh\left(\alpha \cdot \frac{x – \mu}{\sigma}\right) + \beta$$
where $$\alpha$$ and $$\beta$$ are learnable parameters controlling scaling and shifting, $$\mu$$ and $$\sigma$$ are the mean and standard deviation of the input feature tensor. For weak responses under adverse conditions, DyT dynamically amplifies relevant features. We place this activation in the main path of each residual structure in the DMTSA module.
STSSA (Sparse Temporal Token Statistics Self-Attention). To address the motion trend problem, we combine TSSA and InfLLM-V2. TSSA abandons the standard Query-Key-Value computation and instead computes second-order moment statistics within groups of tokens. This statistic characterizes the feature distribution intensity within the group, avoiding $$\mathcal{O}(n^2)$$ complexity. However, pure TSSA still operates in a global static manner. We integrate InfLLM-V2’s “sparse screening” idea: during the second-moment calculation, a parameter-free adaptive threshold based on feature similarity is used to select only tokens whose similarity to the UAV drone target features exceeds the threshold. The remaining redundant tokens undergo simple average pooling. This reduces complexity from $$\mathcal{O}(n)$$ to $$\mathcal{O}(k)$$ where $$k$$ is the number of effective tokens, and ensures the attention focuses on motion-relevant features.
Mona Adapter. TSSA tends to capture global common features but lacks sensitivity to local discriminative details. Mona (Multi-cognitive Visual Adapter) is a lightweight adapter that calibrates input features via scaled layer normalization:
$$x_{\text{norm}} = s_1 \cdot \|x_0\|_{\text{LN}} + s_2 \cdot x_0$$
where $$x_0$$ is the original feature, $$\|\cdot\|_{\text{LN}}$$ denotes layer normalization, and $$s_1, s_2$$ are learnable weights. After calibration, Mona applies multi-scale depthwise separable convolutions for detail extraction and then aggregates and upsamples features. The DMTSA module integrates these three components into a unified structure that replaces C2PSA. The following table outlines the components and their benefits.
| Component | Function | Benefit for UAV Detection |
|---|---|---|
| Dynamic Tanh | Adaptive activation | Enhances weak features in adverse weather |
| STSSA (TSSA + InfLLM-V2) | Sparse temporal statistics attention | Captures motion trends, reduces complexity |
| Mona Adapter | Local detail calibration and enhancement | Preserves local discriminative features |
Experiments and Results
Experimental Setup
We conducted all experiments on a single NVIDIA RTX 4060 GPU with Windows 11, Python 3.10, and CUDA 11.8. The training and evaluation hyperparameters are listed in the table below.
| Parameter | Value |
|---|---|
| Image size | 640 × 640 |
| Epochs | 400 |
| Optimizer | SGD |
| Momentum | 0.937 |
| Learning rate schedule | Cosine annealing |
Dataset
We used the Anti-UAV-Detection dataset, which is a commonly used benchmark for UAV drone detection. The training set contains 5200 images, validation set 2600 images, and test set 2200 images. Samples include various scenarios with small UAV drones, complex backgrounds, and challenging lighting conditions.
Evaluation Metrics
We report Precision (P), Recall (R), mean Average Precision at IoU threshold 0.5 (mAP50), number of parameters (Parameters), and computational complexity (GFLOPS). The formulas are defined as:
$$P = \frac{TP}{TP + FP}, \quad R = \frac{TP}{TP + FN}$$
$$\text{mAP50} = \frac{1}{C} \sum_{k=1}^{C} AP_{0.5}^k$$
where C is the number of classes (here C=1 for UAV drones).
Ablation Studies
We performed two sets of ablation experiments. The first set studies the contribution of each component within the DMTSA module. We start from the baseline YOLOv11n and progressively add DyT, Mona, and STSSA. Results are shown in the table below.
| Group | DyT | Mona | STSSA | P (%) | R (%) | mAP50 (%) | GFLOPS | Params (M) |
|---|---|---|---|---|---|---|---|---|
| 1 | — | — | — | 92.36 | 75.74 | 84.75 | 6.3 | 2.582 |
| 2 | ✓ | — | — | 92.19 | 75.96 | 84.91 | 6.4 | 2.590 |
| 3 | ✓ | ✓ | — | 92.44 | 76.33 | 85.04 | 6.5 | 2.643 |
| 4 | ✓ | ✓ | ✓ | 92.75 | 76.62 | 85.19 | 6.6 | 2.652 |
The second ablation study evaluates the effect of MSES and DMTSA modules together. We compare baseline, baseline + MSES, baseline + DMTSA, and the full AU-YOLO. Results are as follows.
| Group | MSES | DMTSA | P (%) | R (%) | mAP50 (%) | GFLOPS | Params (M) |
|---|---|---|---|---|---|---|---|
| 1 | — | — | 92.36 | 75.74 | 84.75 | 6.3 | 2.582 |
| 2 | ✓ | — | 92.93 | 76.99 | 84.69 | 6.5 | 2.573 |
| 3 | — | ✓ | 92.75 | 76.62 | 85.19 | 6.6 | 2.652 |
| 4 | ✓ | ✓ | 93.26 | 76.84 | 85.83 | 6.8 | 2.661 |
The results demonstrate that both modules improve detection performance. MSES alone slightly reduces mAP50 but improves recall, while DMTSA brings a clear gain. The full model achieves the highest mAP50 of 85.83% with only a modest increase in GFLOPS and parameters.
Complex Scenario Evaluation
To verify performance in challenging real-world conditions, we selected two subsets from the test set: a “UAV swarm” scenario (300 images with multiple UAV drones) and an “adverse weather” scenario (200 images with snow, low light). Results are compared between YOLOv11n and AU-YOLO.
| Model | Scenario | P (%) | R (%) | mAP50 (%) |
|---|---|---|---|---|
| YOLOv11n | UAV swarm | 92.16 | 75.48 | 84.56 |
| YOLOv11n | Adverse weather | 89.23 | 70.49 | 76.65 |
| AU-YOLO | UAV swarm | 92.92 | 76.32 | 85.68 |
| AU-YOLO | Adverse weather | 92.57 | 69.47 | 77.47 |
AU-YOLO consistently outperforms the baseline in both scenarios, particularly in adverse weather where precision improves by over 3%, demonstrating the effectiveness of the DMTSA module in handling weak features.
Comparison with State-of-the-Art Models
We compared AU-YOLO with several lightweight object detectors: Rtmdet-tiny, SSD-MobileNetv2, Mamba-YOLO, Hyper-YOLO, and YOLO-world. All models were trained and tested under identical conditions.
| Model | P (%) | R (%) | mAP50 (%) | GFLOPS | Params (M) |
|---|---|---|---|---|---|
| Rtmdet-tiny | 92.80 | 76.68 | 85.03 | 8.0 | 4.876 |
| SSD-MobileNetv2 | 90.12 | 71.13 | 81.81 | 6.0 | 3.542 |
| Mamba-YOLO | 93.10 | 77.22 | 85.53 | 12.3 | 5.662 |
| Hyper-YOLO | 93.28 | 78.32 | 85.63 | 9.5 | 3.620 |
| YOLO-world | 91.54 | 76.38 | 83.97 | 12.8 | 4.047 |
| AU-YOLO | 93.26 | 76.84 | 85.83 | 6.8 | 2.661 |
AU-YOLO achieves the highest mAP50 (85.83%) while requiring the fewest parameters (2.661 M) and moderate GFLOPS (6.8). This demonstrates its excellent trade-off between accuracy and efficiency, making it ideal for resource-constrained deployment.
Deployment on Embedded Platform
Before deployment, we applied model compression using LAMP pruning and CWD knowledge distillation to reduce size. The results of the lightweighting process are shown below.
| Stage | P (%) | R (%) | mAP50 (%) | GFLOPS | Params (M) |
|---|---|---|---|---|---|
| Baseline (AU-YOLO) | 92.36 | 75.74 | 84.75 | 6.3 | 2.582 |
| After pruning | 91.58 | 74.59 | 83.58 | 2.9 | 1.244 |
| After distillation | 92.06 | 75.15 | 84.53 | 2.9 | 1.244 |
Knowledge distillation recovers most of the performance loss from pruning, with only 0.22% drop in mAP50 while reducing GFLOPS by 54% and parameters by 52%. The compressed model was then deployed on a Jetson Orin NX. We converted the .pt file to .onnx format and tested on video sequences. Results are presented in the table below.
| Platform | P (%) | R (%) | mAP50 (%) | GFLOPS | Params (M) | FPS |
|---|---|---|---|---|---|---|
| RTX 4060 | 92.06 | 75.15 | 84.53 | 2.9 | 1.244 | 76 |
| Jetson Orin NX | 91.12 | 74.44 | 83.68 | 2.9 | 1.244 | 41 |
On the Jetson Orin NX, AU-YOLO achieves 41 FPS, fully satisfying real-time detection requirements, with mAP50 still above 83%. This confirms its practical value for low-altitude no-fly zone surveillance, where low-cost embedded devices can be deployed without relying on high-performance servers.
Conclusion
In this work, we proposed AU-YOLO, an improved model based on YOLOv11 specifically optimized for low-altitude UAV drone detection. By introducing the MSES edge feature enhancement module and the DMTSA dynamic temporal-spatial feature refinement module, we significantly improved detection accuracy and robustness in complex scenarios while maintaining model lightweightness. Ablation studies validated the effectiveness of each module. Comparative experiments showed that AU-YOLO outperforms other state-of-the-art lightweight detectors in terms of mAP50 and parameter efficiency. Finally, deployment on an embedded platform demonstrated its real-world engineering applicability. Future work will focus on optimizing the model for dynamic video sequences and extending its application to more security scenarios.
