The rapid advancement of drone technology has introduced significant safety and security challenges, making effective anti-drone detection a critical task. Traditional detection methods like radar, radio frequency, and acoustic sensing are often hampered by high cost, environmental sensitivity, and limited effectiveness against small, fast-moving, or low-signature drones. Computer vision-based detection offers a promising alternative, but it faces substantial hurdles in real-world anti-drone scenarios, including complex backgrounds, dramatic lighting variations, and the inherently small size of distant UAV targets, which frequently lead to missed detections, false alarms, and unsatisfactory accuracy.
To address these challenges, this paper proposes PWM-YOLOv11, a lightweight and accurate object detection algorithm tailored for anti-drone missions. Built upon the efficient YOLOv11n architecture, our model introduces three key innovations: a Partial Series Multi-scale Fusion (PSMF) module to enhance feature extraction for blurred or small targets, a Wavelet Pooling (WaveletPool) convolution to suppress background noise while preserving crucial details, and a Modified Bidirectional Feature Pyramid Network (MBiFPN) for efficient multi-scale feature fusion. Extensive experiments on the TIB-UAV dataset demonstrate that PWM-YOLOv11n achieves a superior balance between accuracy and efficiency, making it highly suitable for deployment in real-time anti-drone systems.

1. Introduction
The proliferation of unmanned aerial vehicles (UAVs) has revolutionized fields from agriculture to surveillance. However, the misuse of drones for unauthorized surveillance, contraband delivery, or disrupting secure airspace has escalated, necessitating robust anti-drone countermeasures. Reliable detection is the foundational step in any anti-drone system. While traditional sensor-based methods exist, vision-based detection using deep learning provides a passive, cost-effective, and information-rich solution. Yet, deploying such models in practical anti-drone settings is non-trivial. Drones often appear as tiny, low-resolution objects against cluttered skies or complex terrains. Their high speed can cause motion blur, and changing illumination can alter their appearance drastically. These factors severely degrade the performance of standard object detectors, which may suffer from low precision, high false-negative rates, and excessive computational demands unsuitable for edge devices.
Recent years have seen numerous attempts to adapt object detectors like the YOLO series for anti-drone tasks. Common improvements involve integrating attention mechanisms, designing specialized feature fusion necks, or employing advanced loss functions. While these methods offer gains, they often struggle to simultaneously achieve high accuracy on challenging small targets and maintain a lightweight profile for real-time processing. Our work directly tackles this trade-off. We present PWM-YOLOv11, an optimized model that significantly enhances detection capability for drones under adverse conditions while reducing parameter count. The core of our approach lies in re-engineering critical components of the network to be more attuned to the specific demands of anti-drone target detection.
2. Methodology: The PWM-YOLOv11 Architecture
The YOLOv11 framework provides a strong baseline with its efficient CSPDarknet-like backbone and decoupled head. However, for the demanding task of anti-drone detection, we identify areas for targeted improvement. Our proposed PWM-YOLOv11n model modifies the backbone, neck, and feature processing stages as illustrated below. The overarching goal is to sharpen the model’s focus on discriminative drone features, improve its multi-scale reasoning, and streamline its computational graph.
2.1 Partial Series Multi-scale Fusion (PSMF) Module
Detecting distant drones often results in low-resolution targets where edge details are模糊的 (blurred). The original C3k2 module in YOLOv11, while efficient, may lose fine-grained details through its bottleneck structure and aggressive channel reduction. To mitigate this, we propose the PSMF module as a direct replacement. The PSMF module abandons the standard bottleneck and instead employs a series of partial convolutions with different kernel sizes operating on separate channel groups.
Let an input feature map be denoted as $F_{in} \in \mathbb{R}^{H \times W \times C}$. The PSMF module first splits the channels into four groups. Three groups are processed in parallel by partial convolutions with kernels of size $3\times3$, $5\times5$, and $7\times7$, respectively, to capture features at different receptive fields. The fourth group remains unchanged to preserve the original information stream. The outputs of these parallel paths are concatenated:
$$ F_{cat} = \text{Concat}(\text{Conv}_{3\times3}(F_{in}^1), \text{Conv}_{5\times5}(F_{in}^2), \text{Conv}_{7\times7}(F_{in}^3), F_{in}^4) $$
where $F_{in}^1, F_{in}^2, F_{in}^3, F_{in}^4$ are the channel-wise splits of $F_{in}$.
A subsequent $1\times1$ convolution is applied to $F_{cat}$ to fuse the multi-scale information and adjust channel dimensions:
$$ F_{fused} = \text{Conv}_{1\times1}(F_{cat}) $$
Finally, a residual connection adds the processed features back to the original input, ensuring no loss of foundational information:
$$ F_{out} = F_{fused} + F_{in} $$
This design enriches the feature representation with multi-scale contextual details crucial for identifying small, blurry drones in complex anti-drone scenarios, without a significant parameter increase.
2.2 Wavelet Pooling (WaveletPool) Convolution
Standard pooling operations (e.g., max-pooling) achieve downsampling but inevitably discard high-frequency details like edges and textures—information vital for distinguishing small drones from background clutter. To perform smarter downsampling that retains critical frequency components, we integrate Wavelet Pooling into the backbone. Inspired by the 2D Discrete Wavelet Transform (DWT), this layer decomposes a feature map into four sub-bands (LL, LH, HL, HH) representing different frequency orientations.
Formally, for an input feature map $X$, the 2D DWT applies low-pass ($h_{ll}$) and high-pass ($h_{hh}$) filters along rows and columns, producing four sub-sampled outputs:
$$ \text{LL} = h_{ll} * X \downarrow_2, \quad \text{LH} = h_{lh} * X \downarrow_2 $$
$$ \text{HL} = h_{hl} * X \downarrow_2, \quad \text{HH} = h_{hh} * X \downarrow_2 $$
where $*$ denotes convolution and $\downarrow_2$ denotes downsampling by 2. In our WaveletPool layer, these operations are implemented using fixed, learnable filters. The LL component carries the approximate (low-frequency) information, while LH, HL, and HH contain horizontal, vertical, and diagonal detail (high-frequency) information, respectively.
By concatenating these sub-bands, the layer effectively expands the channel dimension by a factor of four while reducing spatial resolution by half. This process preserves fine details that would be lost in traditional pooling, enhancing the model’s ability to describe drone shapes and edges amidst noise. This is particularly beneficial in anti-drone applications where targets are subtle and backgrounds are noisy.
2.3 Modified Bidirectional Feature Pyramid Network (MBiFPN)
Effective fusion of features from different network depths is paramount for detecting objects at various scales. The Path Aggregation Feature Pyramid Network (PAF-FPN) in YOLOv8/v11 is effective but can be computationally heavy. Our MBiFPN structure builds upon the efficient BiFPN design, which introduces learnable weights for feature fusion and cross-scale bidirectional connections. We make two critical modifications for the anti-drone task.
First, we prune one downsampling stage and its subsequent PSMF module from the backbone. This maintains higher spatial resolution in the deeper layers, providing richer feature maps for small drone detection. Let the original feature pyramid levels be P2, P3, P4, P5 (from shallow to deep). After pruning, we work with P2, P3, P4, where P4 is the new deepest layer with a larger effective receptive field relative to the input than in the standard structure.
Second, we construct the MBiFPN using these pruned features. It employs efficient cross-scale connections and weighted fusion. The fusion of two features $P_i$ and $P_j$ at a node is computed as:
$$ P^{out} = \text{Conv}\left( \frac{w_1 \cdot P_i + w_2 \cdot P_j}{w_1 + w_2 + \epsilon} \right) $$
where $w_1$ and $w_2$ are learnable weights per feature, and $\epsilon$ is a small constant for numerical stability. The bidirectional flow allows deep semantic information to refine shallow features and detailed shallow information to enhance deep features. This lightweight yet powerful fusion mechanism ensures that both the high-resolution details from early layers and the strong semantic context from later layers are optimally combined for accurate drone localization and classification across scales.
3. Experiments and Analysis
3.1 Dataset and Implementation Details
We evaluate our model on the TIB-UAV dataset, a composite of public and self-collected images depicting drones in diverse environments like open skies, urban settings, and forested areas, under various lighting and weather conditions. The dataset contains 5,018 images, split into training, validation, and test sets with an 8:1:1 ratio. All models are trained from scratch for 300 epochs with an input size of $640 \times 640$, a batch size of 16, and the SGD optimizer with an initial learning rate of 0.01.
3.2 Evaluation Metrics
We use standard metrics to assess performance: Precision (P), Recall (R), mean Average Precision at IoU=0.5 (mAP@50), mAP over IoU thresholds from 0.5 to 0.95 (mAP50:95), number of parameters (Params), and Frames Per Second (FPS). Precision and Recall are defined as:
$$ P = \frac{TP}{TP + FP}, \quad R = \frac{TP}{TP + FN} $$
where $TP$, $FP$, and $FN$ are true positives, false positives, and false negatives, respectively. The Average Precision (AP) is the area under the Precision-Recall curve, and mAP is the mean AP across all classes.
3.3 Ablation Study
To validate the contribution of each component, we conduct an ablation study starting from the baseline YOLOv11n (denoted v11n). The results are summarized in the table below.
| Model | P (%) | R (%) | Params (M) | mAP@50 (%) | mAP50:95 (%) |
|---|---|---|---|---|---|
| v11n (Baseline) | 93.1 | 80.4 | 2.6 | 89.6 | 47.5 |
| + PSMF only (v11n-1) | 91.1 | 83.7 | 2.6 | 89.9 | 47.8 |
| + WaveletPool only (v11n-2) | 93.8 | 82.5 | 2.2 | 89.6 | 47.2 |
| + MBiFPN only (v11n-3) | 91.6 | 91.2 | 1.6 | 92.9 | 49.2 |
| + PSMF & WaveletPool (v11n-4) | 92.0 | 82.7 | 2.2 | 89.6 | 47.3 |
| PWM-YOLOv11n (v11n-5) | 93.6 | 94.0 | 1.5 | 95.1 | 50.3 |
The analysis reveals that the MBiFPN structure alone brings the most substantial improvement in Recall and mAP while dramatically reducing parameters, highlighting its effectiveness for lightweight, accurate anti-drone detection. The PSMF module improves Recall, aiding in reducing missed detections. The full integration of all three components (PWM-YOLOv11n) yields the best overall performance, achieving a remarkable 95.1% mAP@50 and 50.3% mAP50:95—gains of 5.5% and 2.8% over the baseline—with 42.3% fewer parameters. This demonstrates successful synergistic design.
3.4 Comparative Experiments
We compare our PWM-YOLOv11n against other state-of-the-art lightweight YOLO models and specialized anti-drone detectors. The results are presented in the following table.
| Model | P (%) | R (%) | Params (M) | mAP@50 (%) | mAP50:95 (%) | FPS |
|---|---|---|---|---|---|---|
| YOLOv5n | 90.6 | 82.5 | 2.6 | 88.9 | 46.9 | 128.6 |
| YOLOv8n | 92.1 | 83.9 | 2.6 | 90.7 | 47.8 | 130.0 |
| YOLOv10n | 89.5 | 83.4 | 2.0 | 89.9 | 47.9 | 121.0 |
| YOLOv11n | 93.1 | 80.4 | 2.6 | 89.6 | 47.5 | 136.3 |
| Hyper-YOLO | 92.5 | 83.5 | 3.6 | 90.2 | 47.7 | 127.3 |
| YOLO-DAP | 89.8 | 92.5 | 1.23 | 92.7 | – | 135.2 |
| PWM-YOLOv11n (Ours) | 93.6 | 94.0 | 1.5 | 95.1 | 50.3 | 142.0 |
Our model consistently outperforms all competitors across nearly all metrics. It achieves the highest Precision, Recall, and mAP scores. Notably, it surpasses the specialized anti-drone detector YOLO-DAP in mAP@50 by 2.4% while being more parameter-efficient than most other YOLO variants. The inference speed (FPS) is also the highest, confirming that our improvements do not come at the cost of latency. This comprehensive superiority underscores the effectiveness of PWM-YOLOv11n for practical anti-drone systems.
3.5 Qualitative Analysis
Visual comparisons on challenging test samples further illustrate the strengths of our approach. In scenes with complex backgrounds (e.g., drones against white clouds or buildings), PWM-YOLOv11n maintains high confidence scores where other detectors fluctuate. In cases of strong glare or specular highlights, our model successfully detects drones that are missed by YOLOv11n and Hyper-YOLO, thanks to the robust feature representation from PSMF and WaveletPool. For extremely small drones occupying few pixels, PWM-YOLOv11n demonstrates a significantly lower rate of missed detections, a direct benefit of the high-resolution preservation and effective multi-scale fusion of MBiFPN. These visual results align with the quantitative metrics, proving the model’s robustness in diverse anti-drone operational environments.
4. Conclusion
In this paper, we presented PWM-YOLOv11, a novel lightweight object detection algorithm designed to overcome the specific challenges of vision-based anti-drone systems. By integrating a Partial Series Multi-scale Fusion (PSMF) module, a Wavelet Pooling convolution, and a Modified BiFPN (MBiFPN) neck, the model achieves enhanced capability to capture discriminative features of small, blurry drones while effectively suppressing background interference. Extensive experiments on the TIB-UAV dataset validate our design, showing that PWM-YOLOv11n sets a new state-of-the-art balance, offering superior detection accuracy (95.1% mAP@50), reduced model complexity (1.5M parameters), and high inference speed. This makes it a highly promising solution for deploying efficient and reliable anti-drone surveillance on edge computing platforms. Future work may involve extending the model to multi-modal anti-drone detection by fusing visual data with thermal or RF signatures for all-weather, 24/7 operational capability.
