In recent years, the rapid advancement of drone technology has led to the widespread adoption of Unmanned Aerial Vehicles (UAVs) in various fields such as agricultural monitoring, power line inspection, and security surveillance. These applications highlight the transformative potential of drone technology in enhancing operational efficiency and reducing human labor. However, the proliferation of Unmanned Aerial Vehicles also introduces significant security risks, including illegal “black flights” and malicious activities using drones, which threaten public safety and critical infrastructure. To address these challenges, anti-drone detection systems have become essential for identifying and tracking unauthorized UAVs in real-time. Traditional detection methods, such as radar-based systems, radio frequency analysis, and acoustic sensors, often suffer from limitations like high cost, environmental dependencies, and susceptibility to noise interference. For instance, radar systems struggle with small radar cross-sections and ground clutter, while acoustic methods fail in noisy environments where drone signatures are masked. Consequently, vision-based detection using deep learning has emerged as a promising alternative, offering cost-effectiveness and adaptability to diverse scenarios. In this paper, we propose PWM-YOLOv11, an improved algorithm based on YOLOv11n, designed to enhance the detection of Unmanned Aerial Vehicles in complex environments characterized by background clutter, varying illumination, and small target sizes. Our approach integrates novel modules to address these challenges, achieving a balance between accuracy, speed, and model lightweighting.

The core of our work revolves around addressing the inherent difficulties in detecting Unmanned Aerial Vehicles, which often appear as small objects with blurred edges due to high-speed motion and environmental factors. Existing object detection algorithms, including variants of YOLO, have shown promise but still face issues like missed detections and reduced precision under challenging conditions. For example, YOLOv11n incorporates advanced feature extraction modules like C3k2 and C2PSA, yet it may lose critical details during dimensionality reduction. To overcome this, we introduce the Partial Series Multi-scale Fusion (PSMF) module, which leverages multi-scale convolutional kernels and residual connections to enhance feature representation without increasing computational overhead. Additionally, we integrate WaveletPool convolution, a technique inspired by wavelet transforms, to preserve high-frequency details and suppress background noise during down-sampling. This is particularly important for drone technology applications where maintaining edge information is crucial for accurate detection. Furthermore, we design the Multi-scale Bidirectional Feature Pyramid Network (MBiFPN), a lightweight feature fusion structure that optimizes information flow across different scales, improving the detection of small UAVs while reducing model parameters. Through extensive experiments on the TIB-UAV dataset, which combines public and custom data, we demonstrate that PWM-YOLOv11n achieves superior performance compared to state-of-the-art methods, with significant improvements in metrics such as mean Average Precision (mAP) and recall, while maintaining real-time processing capabilities.
In the following sections, we delve into the details of our methodology, starting with an overview of the YOLOv11 architecture and its limitations in the context of anti-drone detection. We then describe the design and implementation of the PSMF module, WaveletPool convolution, and MBiFPN structure, supported by mathematical formulations and architectural diagrams. Subsequently, we present experimental results, including ablation studies and comparisons with other models, to validate the effectiveness of our approach. The integration of these components not only advances drone technology by enabling more reliable UAV detection but also contributes to the broader field of object detection in aerial imagery. By focusing on lightweight and efficient solutions, our work aligns with the growing demand for deployable anti-drone systems in real-world scenarios, where resources are often constrained. As drone technology continues to evolve, the need for robust detection algorithms will only intensify, making our contributions timely and relevant.
Related Work
Object detection in the domain of Unmanned Aerial Vehicles has been extensively studied, with numerous approaches leveraging deep learning techniques. Early methods relied on two-stage detectors like Faster R-CNN, which achieve high accuracy but are computationally intensive, limiting their use in real-time applications. In contrast, one-stage detectors such as YOLO and SSD offer a better trade-off between speed and accuracy, making them suitable for drone detection tasks. For instance, YOLOv5 and its variants have been adapted for UAV detection by incorporating attention mechanisms and enhanced feature fusion networks. However, these models often struggle with small targets and complex backgrounds, leading to reduced performance in practical anti-drone scenarios. Recent advancements in drone technology have prompted the development of specialized algorithms, such as Hyper-YOLO and YOLO-DAP, which focus on improving detection precision through modular innovations. Hyper-YOLO integrates hyper-networks to dynamically adjust parameters, while YOLO-DAP employs depthwise separable convolutions and optimized loss functions. Despite these improvements, issues like model redundancy and sensitivity to environmental factors persist. Our work builds upon YOLOv11, which introduces elements like C3k2 modules and depthwise convolutions in the head section, but we further enhance it with lightweight and efficient components tailored for Unmanned Aerial Vehicle detection.
Another key area of research involves multi-scale feature handling, which is critical for detecting UAVs at varying distances. Techniques like Feature Pyramid Networks (FPN) and their bidirectional variants (BiFPN) have been widely adopted to fuse features from different layers, enabling better representation of objects at multiple scales. However, standard FPN structures may introduce computational bottlenecks, especially in resource-constrained environments. To address this, we propose MBiFPN, which simplifies the network by pruning unnecessary layers and incorporating adaptive weight learning for feature fusion. This approach is inspired by the success of BiFPN in other domains but optimized specifically for the challenges of drone technology. Additionally, pooling operations play a vital role in down-sampling, but traditional methods like max pooling can lead to information loss. Wavelet-based pooling, as used in our WaveletPool module, offers an alternative by decomposing features into frequency sub-bands, preserving details that are essential for small object detection. This innovation aligns with the ongoing efforts to enhance the robustness of vision-based systems in anti-drone applications, where Unmanned Aerial Vehicles often exhibit rapid movements and varying appearances.
Methodology
Our proposed PWM-YOLOv11 algorithm is built upon the YOLOv11n architecture, which consists of three main components: Backbone, Neck, and Head. The Backbone is responsible for feature extraction, the Neck performs multi-scale feature fusion, and the Head generates the final detection outputs. We introduce three key modifications to address the limitations of the base model in detecting Unmanned Aerial Vehicles: the PSMF module, WaveletPool convolution, and MBiFPN structure. These innovations collectively improve feature representation, reduce computational cost, and enhance detection accuracy for small and blurry targets commonly encountered in drone technology applications.
Partial Series Multi-scale Fusion (PSMF) Module
The PSMF module replaces the standard C3k2 modules in YOLOv11n to better handle multi-scale features and preserve edge details. The C3k2 module uses multiple Bottleneck structures with dimensionality reduction, which can lead to the loss of fine-grained information. In contrast, the PSMF module employs partial convolutions with kernels of sizes 3×3, 5×5, and 7×3, applied to different channels of the input feature map. This allows the model to capture features at various scales without compressing the channel dimensions excessively. The output from these convolutions is concatenated with the unchanged channels, followed by a 1×1 convolution to fuse the multi-scale information. A residual connection is added to combine the input and processed features, ensuring that original details are retained while introducing enriched contextual information. Mathematically, for an input feature map $X \in \mathbb{R}^{H \times W \times C}$, the PSMF operation can be described as follows:
Let $X_{\text{part}} \in \mathbb{R}^{H \times W \times C/2}$ be the portion of channels processed by the partial convolutions. The outputs of the convolutions are:
$$Y_3 = \text{Conv}_{3\times3}(X_{\text{part}}), \quad Y_5 = \text{Conv}_{5\times5}(X_{\text{part}}), \quad Y_7 = \text{Conv}_{7\times7}(X_{\text{part}}).$$
These are concatenated along the channel dimension: $Y_{\text{cat}} = [Y_3, Y_5, Y_7] \in \mathbb{R}^{H \times W \times 3C/4}$. The unchanged channels $X_{\text{rest}} \in \mathbb{R}^{H \times W \times C/2}$ are then combined with $Y_{\text{cat}}$ to form $Z = [X_{\text{rest}}, Y_{\text{cat}}] \in \mathbb{R}^{H \times W \times 5C/4}$. A 1×1 convolution is applied to reduce the channels back to C: $Z_{\text{fused}} = \text{Conv}_{1\times1}(Z)$. Finally, the residual connection produces the output: $O = X + Z_{\text{fused}}$. This design enhances the model’s ability to detect Unmanned Aerial Vehicles with varying sizes and resolutions, which is crucial for advancing drone technology in dynamic environments.
WaveletPool Convolution
To address the issue of detail loss during down-sampling, we integrate WaveletPool convolution into the Backbone network. Traditional pooling operations, such as max pooling, often discard high-frequency components that are essential for detecting small UAVs. WaveletPool, based on the 2D discrete wavelet transform, decomposes the input feature map into four sub-bands: LL (approximation), LH (horizontal details), HL (vertical details), and HH (diagonal details). This decomposition preserves spatial information while reducing resolution, making it ideal for applications in drone technology where target details are sparse. The wavelet transform equations are defined as:
$$W_{\Phi}[j+1, k] = \sum_n h_{\Phi}[n-2k] \cdot W_{\Phi}[j, n],$$
$$W_{\Psi}[j+1, k] = \sum_n h_{\Psi}[n-2k] \cdot W_{\Psi}[j, n],$$
where $W_{\Phi}$ and $W_{\Psi}$ represent the approximation and detail coefficients, respectively, $h_{\Phi}$ and $h_{\Psi}$ are the scaling and wavelet vectors, $j$ denotes the resolution level, and $k$ is the translation parameter. The inverse transform reconstructs the features as:
$$W_{\Phi}[j, k] = \sum_n h_{\Phi}[n-2k] \cdot W_{\Phi}[j+1, n] + \sum_n h_{\Psi}[n-2k] \cdot W_{\Psi}[j+1, n].$$
In our implementation, we use Haar wavelets for efficiency. The LL sub-band is retained for further processing, while the detail sub-bands are used to enhance feature representation. This approach significantly reduces background noise and improves the model’s focus on relevant Unmanned Aerial Vehicle features, leading to higher detection accuracy in cluttered scenes.
Multi-scale Bidirectional Feature Pyramid Network (MBiFPN)
The MBiFPN structure is designed to optimize feature fusion in the Neck section, enabling efficient multi-scale detection of UAVs. Based on the BiFPN architecture, MBiFPN incorporates learnable weights to adaptively fuse features from different levels, giving more importance to relevant scales. However, we simplify the network by pruning the last down-sampling layer and associated PSMF modules in the Backbone, which reduces the receptive field and maintains higher spatial resolution for small targets. This pruning step decreases the model size while preserving detailed features essential for detecting Unmanned Aerial Vehicles. The MBiFPN operates on feature maps P2, P3, and P4 from the Backbone, where P2 has the highest resolution. The fusion process involves bidirectional connections with concatenation and convolution operations. For example, the output at level P3 is computed as:
$$P3_{\text{out}} = \text{Conv}(\text{Concat}(P3_{\text{in}}, \text{Upsample}(P4_{\text{in}}), \text{Downsample}(P2_{\text{in}}))),$$
where Upsample and Downsample denote interpolation and pooling operations, respectively. The learnable weights are applied during fusion to emphasize important features. This structure enhances the gradient flow and feature reuse, improving the detection of small UAVs without adding computational overhead. The lightweight nature of MBiFPN makes it suitable for real-time anti-drone systems, where efficiency is paramount in drone technology deployments.
Experiments and Results
We evaluate PWM-YOLOv11 on the TIB-UAV dataset, which comprises 5018 images from public and custom sources, covering diverse scenarios such as open skies, urban areas, and forests. The dataset is split into training, validation, and test sets in an 8:1:1 ratio. Our experiments are conducted on a system with an NVIDIA RTX A4000 GPU, using PyTorch 2.0.1 and CUDA 12.3. The input image size is set to 640×640, and we train the model for 300 epochs with a batch size of 16. The SGD optimizer is employed with an initial learning rate of 0.01. Performance metrics include Precision (P), Recall (R), mAP@50, mAP50:90, Params (number of parameters), and FPS (frames per second). The formulas for these metrics are:
$$P = \frac{TP}{TP + FP}, \quad R = \frac{TP}{TP + FN},$$
$$AP = \int_0^1 P(R) dR, \quad mAP = \frac{1}{n} \sum_{i=1}^n AP_i,$$
where TP, FP, and FN denote true positives, false positives, and false negatives, respectively, and n is the number of classes.
Ablation Study
We conduct ablation experiments to assess the individual contributions of each module in PWM-YOLOv11. The baseline model is YOLOv11n, and we incrementally add the PSMF module, WaveletPool convolution, and MBiFPN structure. The results are summarized in Table 1, showing that each component improves the model’s performance in terms of accuracy and lightweighting. For instance, replacing C3k2 with PSMF (v11n-1) increases recall by 3.4% and mAP50 by 0.3% without changing the parameter count. Adding WaveletPool (v11n-2) boosts precision by 0.7% and recall by 2.1%, while reducing parameters by 0.4M. The MBiFPN structure (v11n-3) achieves the largest gain, with mAP50 improving by 3.3% and parameters decreasing by 1.0M. The full model (v11n-5) combines all modules, resulting in a 5.5% improvement in mAP50 and a 42.3% reduction in parameters compared to the baseline. These findings underscore the effectiveness of our innovations in enhancing drone technology for UAV detection.
| Model | P (%) | R (%) | Params (M) | mAP50 (%) | mAP50:90 (%) |
|---|---|---|---|---|---|
| v11n (Baseline) | 93.1 | 80.4 | 2.6 | 89.6 | 47.5 |
| v11n-1 (PSMF) | 91.1 | 83.7 | 2.6 | 89.9 | 47.8 |
| v11n-2 (WaveletPool) | 93.8 | 82.5 | 2.2 | 89.6 | 47.2 |
| v11n-3 (MBiFPN) | 91.6 | 91.2 | 1.6 | 92.9 | 49.2 |
| v11n-4 (PSMF + WaveletPool) | 92.0 | 82.7 | 2.2 | 89.6 | 47.3 |
| v11n-5 (Full Model) | 93.6 | 94.0 | 1.5 | 95.1 | 50.3 |
Comparative Analysis
We compare PWM-YOLOv11 with several state-of-the-art models, including Hyper-YOLO, YOLOv5n, YOLOv6n, YOLOv8n, YOLOv10n, YOLOv11n, YOLOv12n, and YOLO-DAP, under identical training conditions. The results in Table 2 demonstrate that our model achieves the highest mAP50 (95.1%) and mAP50:90 (50.3%), along with superior recall (94.0%) and competitive precision (93.6%). Notably, PWM-YOLOv11 has only 1.5M parameters, which is 42.3% lower than YOLOv11n, and it achieves an FPS of 142.0, indicating real-time capability. In contrast, other models like YOLOv10n and YOLO-DAP show lower performance in key metrics, despite their lightweight designs. This highlights the advantage of our integrated approach in balancing accuracy and efficiency for Unmanned Aerial Vehicle detection. The repeated emphasis on drone technology in these comparisons reinforces the importance of tailored solutions for anti-drone applications.
| Model | P (%) | R (%) | Params (M) | mAP50 (%) | mAP50:90 (%) | FPS |
|---|---|---|---|---|---|---|
| Hyper-YOLO | 92.5 | 83.5 | 3.6 | 90.2 | 47.7 | 127.3 |
| YOLOv5n | 90.6 | 82.5 | 2.6 | 88.9 | 46.9 | 128.6 |
| YOLOv6n | 91.2 | 81.3 | 4.2 | 88.2 | 46.6 | 117.5 |
| YOLOv8n | 92.1 | 83.9 | 2.6 | 90.7 | 47.8 | 130.0 |
| YOLOv10n | 89.5 | 83.4 | 2.0 | 89.9 | 47.9 | 121.0 |
| YOLOv11n | 93.1 | 80.4 | 2.6 | 89.6 | 47.5 | 136.3 |
| YOLOv12n | 89.4 | 80.4 | 2.5 | 87.2 | 45.6 | 126.6 |
| YOLO-DAP | 89.8 | 92.5 | 1.23 | 92.7 | – | 135.2 |
| PWM-YOLOv11n | 93.6 | 94.0 | 1.5 | 95.1 | 50.3 | 142.0 |
Visualization and Discussion
To qualitatively evaluate the performance, we visualize detection results on sample images from the TIB-UAV dataset, comparing PWM-YOLOv11 with YOLOv11n and Hyper-YOLO. In scenes with complex backgrounds, such as drones against white clouds or buildings, our model demonstrates higher detection confidence and accuracy, with improvements of up to 9% over YOLOv11n. Under varying illumination conditions, where reflections and shadows obscure UAV features, PWM-YOLOv11 successfully identifies targets that are missed by other models. For small-sized drones, our algorithm maintains robust detection due to the enhanced feature preservation in the PSMF and WaveletPool modules. These visual assessments confirm the practical benefits of our approach in real-world drone technology applications, where environmental factors often degrade performance. The integration of frequency-based processing and adaptive fusion allows PWM-YOLOv11 to excel in challenging scenarios, making it a valuable tool for anti-drone systems.
Conclusion
In this paper, we present PWM-YOLOv11, a lightweight object detection algorithm designed for anti-Unmanned Aerial Vehicle tasks. By incorporating the PSMF module, WaveletPool convolution, and MBiFPN structure, we address key challenges in UAV detection, such as small target size, motion blur, and background clutter. Experimental results on the TIB-UAV dataset show significant improvements in accuracy and efficiency, with a 5.5% increase in mAP50 and a 42.3% reduction in parameters compared to YOLOv11n. The algorithm’s ability to maintain real-time performance while achieving high precision makes it suitable for deployment in resource-constrained environments. As drone technology continues to evolve, our work provides a foundation for future research in lightweight and robust detection systems. Potential directions include extending the approach to multi-modal data fusion and adapting it for other aerial objects. Overall, PWM-YOLOv11 represents a step forward in securing airspace against unauthorized UAVs, contributing to the advancement of drone technology and its safe integration into society.
