AU-YOLO: A Lightweight Multi-Scale Edge Information Fusion Algorithm for Low-Altitude UAV Detection

Recent years have witnessed the rapid development of Unmanned Aerial Vehicles (UAVs), which have been widely applied in civilian, industrial, surveying, and many other sectors, providing new pathways for enhancing productivity and efficiency. However, the proliferation of UAVs has also brought about lagging regulatory systems. Unauthorized flights of UAV drones not only easily lead to personal privacy leakage but may also intrude into no-fly zones, causing public safety hazards and directly threatening social governance and public security. Therefore, fast and accurate detection of low-altitude rogue UAV drones and timely risk identification have become critical research topics in low-altitude security. Particularly in small and medium-sized cities, limited by hardware deployment costs and computing resources, there is an urgent need for a UAV drone detection solution that balances detection accuracy, inference efficiency, and deployment economy. With breakthroughs in deep learning for computer vision, visual-based object detection methods have become the mainstream technical route for UAV drone detection due to their low cost, fast detection, and strong adaptability.

In the current research field of visual UAV drone detection, the combination of deep learning and computer vision techniques has become the mainstream solution. Many studies have explored UAV drone detection using various deep learning architectures. For instance, some work employs RetinaNet with multi-scale feature extraction to handle size variations, using Res2Net as the backbone to extract features from multiple receptive fields and designing a hybrid feature pyramid structure. Others use super-resolution combined with Faster R-CNN to improve recall, enlarging the image by a factor of two before detection. Background subtraction combined with enhanced YOLOv5s has also been attempted, though it relies heavily on static scenes and is insufficient in complex environments. Despite these advances, challenges remain: small UAV drones are difficult to identify precisely, and they are easily disturbed by complex backgrounds. To address these challenges, we propose AU-YOLO, a low-altitude UAV drone detection algorithm that achieves excellent performance in real-world scenarios and can effectively detect small UAV drones and swarms.

Proposed Method

We base our model on YOLOv11 and introduce two key modules: the Multi-Scale Edge Selection (MSES) module and the Dynamic Multi-head Temporal-Spatial Attention (DMTSA) module. The overall architecture of AU-YOLO is illustrated as follows. The backbone extracts multi-scale features, where we replace the original C3k2 bottlenecks with our MSES module in certain layers to enhance edge information. The neck incorporates DMTSA modules to refine high-level features with dynamic activation and temporal attention. The head remains the same as YOLOv11 for final detection.

MSES Module

UAV drones in images often exhibit low pixel density and weak edge characteristics. The original Bottleneck in C3k2 uses a fixed combination of 1×1 and 3×3 convolutions, without any specialized design for edge structures of UAV drones. This results in single-dimensional feature extraction and severe loss of edge details, which degrades detection accuracy. To address this, we design the MSES module inspired by the Dual Domain Selection Mechanism (DSM) originally proposed for image restoration. The MSES module consists of an Edge Feature Extraction (EFE) sub-component and a Dual Domain Selection (DSM) sub-component. EFE uses multi-scale convolutions to capture edges at different scales, while DSM performs spatial and frequency selection to enhance high-frequency edge features of UAV drones.

The DSM mechanism comprises two parts: the Spatial Selection Module (SSM) and the Frequency Selection Module (FSM). Given an input feature map $$F_0 \in \mathbb{R}^{H \times W \times C}$$, SSM first performs channel-wise average pooling and a convolution to generate a spatial degradation location feature $$F_1 \in \mathbb{R}^{H \times W \times 1}$$. Then, $$F_1$$ is tiled to $$F_{\text{Tile}} \in \mathbb{R}^{H \times W \times C}$$, and a depthwise convolution on $$F_0$$ produces channel-level degradation features $$F_{\text{channel}}$$. The final spatial selection feature is obtained as:

$$F_g = F_{\text{channel}} \odot F_{\text{Tile}}$$

where $$\odot$$ denotes element-wise multiplication. This operation enhances the response of degraded regions (where UAV drones may appear) while suppressing background. FSM then further refines the feature by removing low-frequency components and amplifying high-frequency details. It performs channel-wise global average pooling to generate low-frequency features $$F_{\text{low}} \in \mathbb{R}^{1 \times 1 \times C}$$, broadcasts to $$F_{\text{Broad}} \in \mathbb{R}^{H \times W \times C}$$, and combines with a high-frequency separation operation similar to EFE to produce the final enhanced feature $$F_{\text{sh}}$$.

The MSES module is designed to seamlessly replace the Bottleneck in C3k2 without changing the backbone structure. It retains the residual design to maintain feature continuity and uses only simple convolutions and pooling, keeping the model lightweight. The following table summarizes the configuration of MSES compared to the original Bottleneck.

Table 1: Comparison of Bottleneck and MSES module configurations
Component Bottleneck (Original) MSES
Convolution layers 1×1 + 3×3 Multi-scale depthwise conv + 1×1
Edge enhancement None EFE + DSM (spatial & frequency)
Residual connection Yes Yes
Parameter efficiency Standard Lightweight (depthwise)

DMTSA Module

The original C2PSA module in YOLOv11 suffers from two problems when applied to UAV drone detection. First, it uses static activation functions, which fail to adapt to complex weather conditions (e.g., backlight, cloudy) where UAV drones exhibit weak features and strong background noise. Second, the original PSA is a single-frame spatial attention that cannot capture the motion trend of fast-moving UAV drones, leading to attention failure. To solve these, we propose the DMTSA module that integrates Dynamic Tanh activation, a joint temporal attention mechanism combining TSSA and InfLLM-V2, and a lightweight Mona adapter.

Dynamic Tanh. We introduce Dynamic Tanh (DyT), an improved version of Tanh that adaptively adjusts activation based on input statistics:

$$\text{DyT}(x) = \tanh\left(\alpha \cdot \frac{x – \mu}{\sigma}\right) + \beta$$

where $$\alpha$$ and $$\beta$$ are learnable parameters controlling scaling and shifting, $$\mu$$ and $$\sigma$$ are the mean and standard deviation of the input feature tensor. For weak responses under adverse conditions, DyT dynamically amplifies relevant features. We place this activation in the main path of each residual structure in the DMTSA module.

STSSA (Sparse Temporal Token Statistics Self-Attention). To address the motion trend problem, we combine TSSA and InfLLM-V2. TSSA abandons the standard Query-Key-Value computation and instead computes second-order moment statistics within groups of tokens. This statistic characterizes the feature distribution intensity within the group, avoiding $$\mathcal{O}(n^2)$$ complexity. However, pure TSSA still operates in a global static manner. We integrate InfLLM-V2’s “sparse screening” idea: during the second-moment calculation, a parameter-free adaptive threshold based on feature similarity is used to select only tokens whose similarity to the UAV drone target features exceeds the threshold. The remaining redundant tokens undergo simple average pooling. This reduces complexity from $$\mathcal{O}(n)$$ to $$\mathcal{O}(k)$$ where $$k$$ is the number of effective tokens, and ensures the attention focuses on motion-relevant features.

Mona Adapter. TSSA tends to capture global common features but lacks sensitivity to local discriminative details. Mona (Multi-cognitive Visual Adapter) is a lightweight adapter that calibrates input features via scaled layer normalization:

$$x_{\text{norm}} = s_1 \cdot \|x_0\|_{\text{LN}} + s_2 \cdot x_0$$

where $$x_0$$ is the original feature, $$\|\cdot\|_{\text{LN}}$$ denotes layer normalization, and $$s_1, s_2$$ are learnable weights. After calibration, Mona applies multi-scale depthwise separable convolutions for detail extraction and then aggregates and upsamples features. The DMTSA module integrates these three components into a unified structure that replaces C2PSA. The following table outlines the components and their benefits.

Table 2: Components of DMTSA module
Component Function Benefit for UAV Detection
Dynamic Tanh Adaptive activation Enhances weak features in adverse weather
STSSA (TSSA + InfLLM-V2) Sparse temporal statistics attention Captures motion trends, reduces complexity
Mona Adapter Local detail calibration and enhancement Preserves local discriminative features

Experiments and Results

Experimental Setup

We conducted all experiments on a single NVIDIA RTX 4060 GPU with Windows 11, Python 3.10, and CUDA 11.8. The training and evaluation hyperparameters are listed in the table below.

Table 3: Experimental hyperparameters
Parameter Value
Image size 640 × 640
Epochs 400
Optimizer SGD
Momentum 0.937
Learning rate schedule Cosine annealing

Dataset

We used the Anti-UAV-Detection dataset, which is a commonly used benchmark for UAV drone detection. The training set contains 5200 images, validation set 2600 images, and test set 2200 images. Samples include various scenarios with small UAV drones, complex backgrounds, and challenging lighting conditions.

Evaluation Metrics

We report Precision (P), Recall (R), mean Average Precision at IoU threshold 0.5 (mAP50), number of parameters (Parameters), and computational complexity (GFLOPS). The formulas are defined as:

$$P = \frac{TP}{TP + FP}, \quad R = \frac{TP}{TP + FN}$$

$$\text{mAP50} = \frac{1}{C} \sum_{k=1}^{C} AP_{0.5}^k$$

where C is the number of classes (here C=1 for UAV drones).

Ablation Studies

We performed two sets of ablation experiments. The first set studies the contribution of each component within the DMTSA module. We start from the baseline YOLOv11n and progressively add DyT, Mona, and STSSA. Results are shown in the table below.

Table 4: Ablation study on DMTSA components
Group DyT Mona STSSA P (%) R (%) mAP50 (%) GFLOPS Params (M)
1 92.36 75.74 84.75 6.3 2.582
2 92.19 75.96 84.91 6.4 2.590
3 92.44 76.33 85.04 6.5 2.643
4 92.75 76.62 85.19 6.6 2.652

The second ablation study evaluates the effect of MSES and DMTSA modules together. We compare baseline, baseline + MSES, baseline + DMTSA, and the full AU-YOLO. Results are as follows.

Table 5: Ablation study of MSES and DMTSA modules
Group MSES DMTSA P (%) R (%) mAP50 (%) GFLOPS Params (M)
1 92.36 75.74 84.75 6.3 2.582
2 92.93 76.99 84.69 6.5 2.573
3 92.75 76.62 85.19 6.6 2.652
4 93.26 76.84 85.83 6.8 2.661

The results demonstrate that both modules improve detection performance. MSES alone slightly reduces mAP50 but improves recall, while DMTSA brings a clear gain. The full model achieves the highest mAP50 of 85.83% with only a modest increase in GFLOPS and parameters.

Complex Scenario Evaluation

To verify performance in challenging real-world conditions, we selected two subsets from the test set: a “UAV swarm” scenario (300 images with multiple UAV drones) and an “adverse weather” scenario (200 images with snow, low light). Results are compared between YOLOv11n and AU-YOLO.

Table 6: Performance on complex scenarios
Model Scenario P (%) R (%) mAP50 (%)
YOLOv11n UAV swarm 92.16 75.48 84.56
YOLOv11n Adverse weather 89.23 70.49 76.65
AU-YOLO UAV swarm 92.92 76.32 85.68
AU-YOLO Adverse weather 92.57 69.47 77.47

AU-YOLO consistently outperforms the baseline in both scenarios, particularly in adverse weather where precision improves by over 3%, demonstrating the effectiveness of the DMTSA module in handling weak features.

Comparison with State-of-the-Art Models

We compared AU-YOLO with several lightweight object detectors: Rtmdet-tiny, SSD-MobileNetv2, Mamba-YOLO, Hyper-YOLO, and YOLO-world. All models were trained and tested under identical conditions.

Table 7: Comparison with lightweight object detectors
Model P (%) R (%) mAP50 (%) GFLOPS Params (M)
Rtmdet-tiny 92.80 76.68 85.03 8.0 4.876
SSD-MobileNetv2 90.12 71.13 81.81 6.0 3.542
Mamba-YOLO 93.10 77.22 85.53 12.3 5.662
Hyper-YOLO 93.28 78.32 85.63 9.5 3.620
YOLO-world 91.54 76.38 83.97 12.8 4.047
AU-YOLO 93.26 76.84 85.83 6.8 2.661

AU-YOLO achieves the highest mAP50 (85.83%) while requiring the fewest parameters (2.661 M) and moderate GFLOPS (6.8). This demonstrates its excellent trade-off between accuracy and efficiency, making it ideal for resource-constrained deployment.

Deployment on Embedded Platform

Before deployment, we applied model compression using LAMP pruning and CWD knowledge distillation to reduce size. The results of the lightweighting process are shown below.

Table 8: Lightweighting results using pruning and distillation
Stage P (%) R (%) mAP50 (%) GFLOPS Params (M)
Baseline (AU-YOLO) 92.36 75.74 84.75 6.3 2.582
After pruning 91.58 74.59 83.58 2.9 1.244
After distillation 92.06 75.15 84.53 2.9 1.244

Knowledge distillation recovers most of the performance loss from pruning, with only 0.22% drop in mAP50 while reducing GFLOPS by 54% and parameters by 52%. The compressed model was then deployed on a Jetson Orin NX. We converted the .pt file to .onnx format and tested on video sequences. Results are presented in the table below.

Table 9: Deployment results on different platforms
Platform P (%) R (%) mAP50 (%) GFLOPS Params (M) FPS
RTX 4060 92.06 75.15 84.53 2.9 1.244 76
Jetson Orin NX 91.12 74.44 83.68 2.9 1.244 41

On the Jetson Orin NX, AU-YOLO achieves 41 FPS, fully satisfying real-time detection requirements, with mAP50 still above 83%. This confirms its practical value for low-altitude no-fly zone surveillance, where low-cost embedded devices can be deployed without relying on high-performance servers.

Conclusion

In this work, we proposed AU-YOLO, an improved model based on YOLOv11 specifically optimized for low-altitude UAV drone detection. By introducing the MSES edge feature enhancement module and the DMTSA dynamic temporal-spatial feature refinement module, we significantly improved detection accuracy and robustness in complex scenarios while maintaining model lightweightness. Ablation studies validated the effectiveness of each module. Comparative experiments showed that AU-YOLO outperforms other state-of-the-art lightweight detectors in terms of mAP50 and parameter efficiency. Finally, deployment on an embedded platform demonstrated its real-world engineering applicability. Future work will focus on optimizing the model for dynamic video sequences and extending its application to more security scenarios.

Scroll to Top