PWM-YOLO: A Lightweight Object Detection Algorithm for Anti-UAV Targets

The rapid advancement of drone technology, marked by increasing intelligence and automation, has delivered substantial benefits across diverse fields such as agricultural inspection, power line patrol, and security monitoring. However, this proliferation has concurrently introduced significant security challenges. Incidents involving illegal “rogue flights” and the malicious use of drones for unlawful activities are becoming more frequent, posing tangible threats to public safety and security. Consequently, the timely identification and tracking of unauthorized or intruding drones through robust anti-UAV detection technology has emerged as a critical task. Current mainstream detection techniques, including radar, radio frequency (RF) spectrum analysis, and acoustic sensing, are often hampered by limitations such as high cost, significant environmental dependencies, and susceptibility to interference. For instance, radar struggles with the small radar cross-section and high velocity of drones amidst ground clutter. RF detection fails against drones not emitting communication signals, while acoustic methods are easily confounded by complex background noise masking the subtle acoustic signatures of drone motors. Therefore, achieving rapid and precise target detection remains a pressing and significant challenge within the anti-UAV domain.

To address the complexities inherent in anti-UAV target detection, researchers have proposed various innovative algorithms. For example, some have integrated coordinate attention mechanisms and adaptive feature fusion networks into the YOLOX framework to enhance the saliency of drone targets and the expressive power of key features, thereby improving both accuracy and robustness against occlusion. Others have built upon the YOLOv5s architecture by incorporating a Slim-Neck design using depthwise separable convolutions for a lightweight feature pyramid and employed the Alpha-CIoU loss function to bolster feature extraction capability while maintaining computational efficiency. Further improvements include refining the feature extraction network with a Dilated Residual Attention module and adopting the PIoU loss function to accelerate model convergence and boost detection accuracy. Another approach based on YOLOv7 introduced an ELAN module to suppress noise during feature fusion and an enhanced HEAD module to reduce the missed detection rate for small drones.

Although these existing methods demonstrate improvements over their baseline models, they continue to face considerable difficulties in real-world anti-UAV scenarios characterized by small target sizes, complex backgrounds, and variable lighting conditions. These challenges often manifest as missed detections, false alarms, and suboptimal precision. To meet the dual requirements of speed and accuracy in practical anti-UAV applications, this paper introduces an improved lightweight target detection algorithm named PWM-YOLOv11, based on the YOLOv11n architecture. Our contributions are threefold: (1) We design a Partial Series Multi-scale Fusion (PSMF) module to enhance the model’s capability to extract multi-scale contextual semantic information, effectively mitigating the issue of missed detections for small, distant drones suffering from blurred edge details. (2) We introduce a Wavelet Pooling (WaveletPool) convolution to suppress background noise interference while preserving crucial target features, thereby improving both detection accuracy and inference speed. (3) We propose a lightweight Multi-scale Bidirectional Feature Pyramid Network (MBiFPN) structure to optimize information flow and enhance multi-scale feature fusion efficiency, leading to superior overall detection performance. Extensive experiments on the TIB-UAV drone dataset show that PWM-YOLOv11 achieves a mAP@50 of 95.1% and a mAP50:95 of 50.3%, representing gains of 5.5% and 2.8% over YOLOv11n, respectively, while simultaneously reducing the parameter count by 42.3%. The improved model achieves an excellent balance between detection accuracy, efficiency, and model size, making it highly suitable for real-world anti-UAV target detection tasks.

Network Architecture of PWM-YOLOv11

YOLOv11 represents an innovative leap forward from YOLOv8. Its core structure remains divided into three parts: the Backbone, the Neck, and the Head. The Backbone introduces novel modules like C3k2 and C2PSA to replace or augment previous designs for stronger feature extraction. The Neck section is largely similar to YOLOv8, maintaining its feature fusion mechanisms. The Head incorporates depthwise separable convolutions to improve detection efficiency and accuracy. While YOLOv11’s optimizations significantly enhance feature extraction and fusion, it still faces the dual challenges of maintaining high precision and achieving lightweight design in demanding anti-UAV target detection scenarios. To tackle these issues, this paper proposes three key improvements to the YOLOv11n model, resulting in the novel PWM-YOLOv11 architecture.

The Partial Series Multi-scale Fusion (PSMF) Module

Detecting drones in flight is often complicated by motion blur and the loss of fine edge details due to their dynamic states, which negatively impacts detection accuracy. To address this, we propose the PSMF module to replace the C3k2 modules within the YOLOv11n model. Although the C3k2 module improves feature extraction efficiency through multiple bottleneck structures, its internal dimensionality reduction operations may prevent sufficient extraction of nuanced details. Therefore, we abandon the standard bottleneck design. The PSMF module is constructed using partial convolutions with different kernel sizes combined with a residual connection. Unlike standard convolution, partial convolution performs multi-scale feature extraction only on a subset of the input channels, leaving the remaining channels unchanged. The features from these different scale operations are then concatenated as output, improving computational efficiency.

Specifically, the input features are processed sequentially by partial convolutions with kernel sizes of 3×3, 5×5, and 7×7 to capture features at different receptive fields. The convoluted feature maps are then concatenated with the bypassed (unconvoluted) feature maps. A 1×1 convolution is subsequently introduced to fuse these multi-scale detail features effectively. Finally, a residual structure adds the original input features to the processed output features. This design ensures the retention of original information while incorporating new multi-scale contextual information, significantly enhancing the model’s ability to recognize fine details and understand scene context for robust performance in complex anti-UAV environments.

The Wavelet Pooling (WaveletPool) Convolution

To address issues like motion artifacts caused by fast-flying drones and the loss of critical information during downsampling in traditional pooling layers, we incorporate a Wavelet Pooling convolution. Conventional pooling operations (e.g., max or average pooling) often discard important spatial details. In contrast, the WaveletPool module leverages a frequency separation mechanism. Its core idea is to use fixed directional filters to decompose the feature map via convolution, expanding the channels into four frequency sub-bands during downsampling. This design preserves edge information to a greater extent and enhances the backbone network’s capacity to represent multi-frequency features.

Inspired by wavelet transform theory, this module extracts and reconstructs image features through different frequency components. The 2D discrete wavelet transform can be formulated as follows for approximation and detail coefficients:

$$
W_{\Phi}[j+1, k] = \sum_{n} h_{\Phi}[n-2k] W_{\Phi}[j, n]
$$

$$
W_{\Psi}[j+1, k] = \sum_{n} h_{\Psi}[n-2k] W_{\Psi}[j, n]
$$

$$
W_{\Phi}[j, k] = \sum_{n} h_{\Phi}[n-2k] W_{\Phi}[j+1, n] + \sum_{n} h_{\Psi}[n-2k] W_{\Psi}[j+1, n]
$$

Here, $\Phi$ denotes the scaling (approximation) function, $\Psi$ denotes the wavelet (detail) function, $W_{\Phi}$ and $W_{\Psi}$ represent the approximation and detail coefficients respectively, and $h_{\Phi}[n-2k]$ and $h_{\Psi}[n-2k]$ are the time-reversed scaling and wavelet vectors, where $[n]$ indexes samples and $[j]$ denotes the resolution level.

The WaveletPool operation decomposes the input into four sub-bands: LL (approximation, containing main contours), LH (horizontal details), HL (vertical details), and HH (diagonal details/fine texture). These feature maps are passed to subsequent layers for further processing and refinement. Through multi-level wavelet decomposition and feature fusion, the network generates multi-scale feature maps rich in frequency details, providing more representative features for the final anti-UAV detection task.

The Multi-scale Bidirectional Feature Pyramid Network (MBiFPN)

To capture the complete form of a drone and prevent it from flying out of the camera’s field of view, surveillance systems often employ a wide shooting area. This results in drones occupying a small pixel area within the image, with insufficient detail, leading to missed and false detections. While YOLOv8’s PAFPN structure adaptively fuses multi-scale features, its incorporation of pyramid attention mechanisms increases model complexity and computational load.

To overcome these limitations and enhance multi-scale feature extraction, we propose an optimized MBiFPN structure based on the efficient BiFPN design. In our MBiFPN, features P2-P4 are extracted from the backbone network. We retain the concatenation method for cross-scale connections. Crucially, we prune the last downsampling layer and a PSMF module from the backbone. Since shallow features possess higher resolution and richer spatial details, this pruning reduces the backbone’s receptive field, helping to maintain higher spatial resolution for better preservation of small target details, thereby improving detection speed and accuracy for anti-UAV tasks. Furthermore, the neck network’s MBiFPN inherits BiFPN’s strength in performing multi-level feature fusion across different resolutions with learnable weights for adaptive feature fusion. This not only enables more accurate information integration, enhancing the model’s perception of target details but also improves overall detection precision. The strategy of combining the pruned shallow features from the backbone with the deeply fused multi-scale features from the MBiFPN compensates for potential detail loss in deep features and augments semantic information in shallow features, achieving an excellent balance between model lightweighting, speed, and accuracy for anti-UAV systems.

Experiments and Analysis

Dataset and Experimental Setup

To validate the performance of the PWM-YOLOv11 algorithm for anti-UAV target detection, we conducted experiments on the TIB-UAV dataset. This dataset is a composite of the public TIB-Net dataset and a custom-collected UAV dataset. The TIB-Net portion consists of images captured by ground-based fixed cameras approximately 500 meters from aerial drones, covering various times and weather conditions across scenes like open skies, forests, and urban areas. To enhance model generalization and supplement scenarios missing from the public data, we constructed a custom UAV dataset by collecting numerous drone images from the web, also covering common flight environments like skies, forests, and buildings. The combined TIB-UAV dataset contains 5,018 images, which we split into training, validation, and test sets in an 8:1:1 ratio.

Our experimental environment uses Windows 10, an NVIDIA RTX A4000 GPU, and 16GB of RAM. The deep learning framework is PyTorch 2.0.1 with Python 3.8.10 and CUDA 12.3. The input image resolution is set to 640×640. We trained the models for 300 epochs with a batch size of 16, using the SGD optimizer with an initial learning rate of 0.01.

Evaluation Metrics

We employ several standard metrics to evaluate model performance: Precision (P), Recall (R), mean Average Precision at an IoU threshold of 0.5 (mAP@50 or mAP50), mean Average Precision over IoU thresholds from 0.5 to 0.95 (mAP50:95), number of parameters (Params), and Frames Per Second (FPS). Precision reflects the reliability of positive predictions, Recall indicates the model’s ability to find all positive instances, and mAP measures overall detection performance across classes. Parameters indicate model complexity and storage requirements. The formulas are as follows:

$$
P = \frac{TP}{TP + FP}
$$

$$
R = \frac{TP}{TP + FN}
$$

$$
AP = \int_{0}^{1} P(R) dR
$$

$$
mAP = \frac{1}{n} \sum_{i=1}^{n} AP_i
$$

where $TP$ denotes true positives, $FP$ false positives, $FN$ false negatives, $AP$ is the area under the Precision-Recall curve, and $n$ is the number of classes.

Ablation Study

To quantitatively evaluate the contribution of each proposed component to the anti-UAV detection performance, we conducted an ablation study using YOLOv11n as the baseline. The results are summarized in the table below.

Model	P (%)	R (%)	Params (M)	mAP50 (%)	mAP50:95 (%)
YOLOv11n (Baseline)	93.1	80.4	2.6	89.6	47.5
+ PSMF only	91.1	83.7	2.6	89.9	47.8
+ WaveletPool only	93.8	82.5	2.2	89.6	47.2
+ MBiFPN only	91.6	91.2	1.6	92.9	49.2
+ PSMF + WaveletPool	92.0	82.7	2.2	89.6	47.3
PWM-YOLOv11n (Ours)	93.6	94.0	1.5	95.1	50.3

Replacing C3k2 modules with the PSMF module (row 2) improved Recall and mAP50 without adding parameters, validating its effectiveness in capturing multi-scale details. Integrating WaveletPool convolution (row 3) improved Precision and Recall, demonstrating its utility in preserving features and suppressing noise. Employing the MBiFPN structure alone (row 4) caused a significant reduction in parameters (1.0M) alongside substantial gains in Recall, mAP50, and mAP50:95, confirming its success in lightweight, efficient feature fusion. The combination of PSMF and WaveletPool (row 5) maintained accuracy while reducing parameters, showing their compatible design. Our full PWM-YOLOv11n model (row 6), integrating all three components, achieved the best performance across all key metrics—significantly outperforming the baseline in Precision (+0.5%), Recall (+13.6%), mAP50 (+5.5%), and mAP50:95 (+2.8%)—while also having the smallest parameter count (1.5M, reduced by 1.1M). This underscores the powerful synergistic effect of our proposed modules for advanced anti-UAV detection.

Comparative Experiments

To further validate the superiority of our PWM-YOLOv11n algorithm, we compared it against several state-of-the-art and classic one-stage object detection models under identical training settings. The results are presented in the following table.

Model	P (%)	R (%)	Params (M)	mAP50 (%)	mAP50:95 (%)	FPS
Hyper-YOLO	92.5	83.5	3.6	90.2	47.7	127.3
YOLOv5n	90.6	82.5	2.6	88.9	46.9	128.6
YOLOv6n	91.2	81.3	4.2	88.2	46.6	117.5
YOLOv8n	92.1	83.9	2.6	90.7	47.8	130.0
YOLOv10n	89.5	83.4	2.0	89.9	47.9	121.0
YOLOv11n	93.1	80.4	2.6	89.6	47.5	136.3
YOLOv12n	89.4	80.4	2.5	87.2	45.6	126.6
YOLO-DAP*	89.8	92.5	1.23	92.7	–	135.2
PWM-YOLOv11n (Ours)	93.6	94.0	1.5	95.1	50.3	142.0

*YOLO-DAP results are from its respective publication; mAP50:95 was not reported.

Among all compared models, only YOLOv8n and Hyper-YOLO achieved mAP50 above 90% besides our method, yet they were outperformed by PWM-YOLOv11n across most metrics. Compared to the lightweight YOLOv10n, our model uses 0.5M fewer parameters while delivering superior results in Precision, Recall, mAP50, mAP50:95, and FPS. Against its direct baseline, YOLOv11n, our model shows a marked improvement in all metrics except Precision (which is still higher) while being significantly more lightweight. Although YOLO-DAP has slightly fewer parameters, our PWM-YOLOv11n surpasses it in Precision, Recall, mAP50, and FPS. These results comprehensively demonstrate that our proposed algorithm achieves an outstanding balance between high detection accuracy, model lightweighting, and real-time inference speed, making it a highly effective solution for practical anti-UAV systems.

Visualization Analysis

To qualitatively highlight the detection effectiveness of our improved algorithm under challenging anti-UAV scenarios—complex backgrounds, lighting variations, and small target sizes—we provide visual comparisons between YOLOv11n, Hyper-YOLO, and our PWM-YOLOv11n. In scenes with cluttered backgrounds where drones blend with similar-colored clouds or buildings, all three models can detect the target, but our PWM-YOLOv11n consistently predicts with higher confidence scores, demonstrating its superior feature discrimination. In cases with strong lighting or glare causing reflections on the drone, both YOLOv11n and Hyper-YOLO occasionally fail to detect the target (missed detection), whereas our algorithm robustly identifies it. For distant drones appearing as very small objects in the image, our model shows more accurate and confident detections compared to the others, which sometimes exhibit lower confidence or miss the target entirely. These visual results corroborate the quantitative findings, proving that the PWM-YOLOv11n algorithm performs more reliably and accurately in difficult real-world conditions encountered in anti-UAV operations.

Conclusion

This paper addresses the persistent challenges in anti-UAV target detection, such as complex backgrounds, variable lighting, and small target size, which lead to missed detections, false alarms, and limited accuracy in existing methods. To overcome these issues, we propose PWM-YOLOv11n, an improved lightweight target detection algorithm based on YOLOv11n. The core innovations include: (1) the Partial Series Multi-scale Fusion (PSMF) module, which enhances multi-scale contextual feature extraction to better handle blurred edges; (2) the Wavelet Pooling (WaveletPool) convolution, which suppresses background noise while preserving critical target features; and (3) the Multi-scale Bidirectional Feature Pyramid Network (MBiFPN), which achieves efficient, lightweight multi-scale feature fusion. Comprehensive experiments on the TIB-UAV dataset demonstrate that our PWM-YOLOv11n algorithm significantly outperforms the baseline, improving Precision, Recall, mAP50, and mAP50:95 by 0.5%, 13.6%, 5.5%, and 2.8%, respectively, while reducing the parameter count by 1.1M (42.3%). The model achieves an excellent balance between high accuracy, low computational cost, and fast inference, validating its effectiveness and advanced capabilities for practical anti-UAV target detection applications.