CBAM-Enhanced YOLOv5s for Small UAV Drone Detection in Complex Backgrounds

Small UAV drone detection has become a critical task in low-altitude security monitoring, especially for scenarios such as airport perimeters, critical infrastructure protection, and large public events. The rapid proliferation of unauthorized UAV drone flights poses serious safety threats, yet detecting these small, fast-moving objects remains challenging due to their tiny pixel footprint, complex background interference, and varying illumination conditions. Traditional computer vision methods based on handcrafted features struggle to adapt to such variability, while deep learning-based object detectors offer a more promising balance between accuracy and real-time performance. Among them, YOLOv5s stands out for its lightweight architecture and high inference speed, making it suitable for edge deployment. However, its capability to detect small UAV drone targets is limited because features of diminutive objects tend to be diluted during multi-layer downsampling. To address this, we introduce a Convolutional Block Attention Module (CBAM) into the backbone of YOLOv5s, forming a novel YOLOv5s-CBAM model. This enhanced model selectively emphasizes informative channels and spatial regions, significantly improving the detection of small UAV drone targets without substantially increasing computational overhead. In this paper, we detail the architecture, present mathematical formulations of the attention mechanisms, provide comprehensive experimental results on the VisDrone2021 dataset, and discuss the practical implications for real-world low-altitude security systems.

YOLOv5s Architecture and Its Limitations in Small UAV Drone Detection

YOLOv5s is a single-stage object detector composed of four main components: an input preprocessing module, a backbone for feature extraction, a neck for multi-scale feature fusion, and a detection head for bounding box and class prediction. The backbone adopts a CSPDarknet structure with multiple C3 modules, each integrating cross-stage partial connections to reduce computation while maintaining representational power. The neck uses a PANet-like structure to aggregate features from different scales, and the detection head outputs predictions at three scales to handle objects of various sizes.

Let the input image be denoted as $\mathbf{I} \in \mathbb{R}^{3 \times H \times W}$. After normalization and augmentation, the backbone produces a series of feature maps $\{\mathbf{F}_1, \mathbf{F}_2, \mathbf{F}_3\}$ at different resolutions. The neck then fuses these maps to obtain enhanced multi-scale features $\mathbf{P}_3, \mathbf{P}_4, \mathbf{P}_5$. The loss function consists of three terms: bounding box regression loss, objectness loss, and classification loss. For bounding box regression, YOLOv5 adopts CIoU loss:

$$
\mathcal{L}_{\text{CIoU}} = 1 – \text{IoU} + \frac{\rho^2(\mathbf{b}, \mathbf{b}^{\text{gt}})}{c^2} + \alpha v,
$$

where $\rho$ is Euclidean distance between predicted center $\mathbf{b}$ and ground-truth center $\mathbf{b}^{\text{gt}}$, $c$ is diagonal length of the smallest enclosing box, and $v$ measures aspect ratio consistency.

Despite its efficiency, YOLOv5s suffers from two major drawbacks when applied to small UAV drone detection. First, small UAV drone targets occupy only a few pixels in the feature maps, and the information is gradually weakened through successive downsampling operations. Second, complex backgrounds (e.g., clouds, buildings, trees, power lines) introduce significant noise that misleads the detector. As a result, the original YOLOv5s yields a precision of only 83.5% and mAP@0.5 of 81.0% on the VisDrone2021 UAV drone dataset, which is insufficient for real security scenarios where false alarms and missed detections are costly.

CBAM Attention Mechanism

The Convolutional Block Attention Module (CBAM) is a lightweight attention module that sequentially infers a channel attention map and a spatial attention map, enabling the network to focus on “what” and “where” are important. It can be easily integrated into any convolutional neural network with minimal additional parameters.

Channel Attention Sub-module

The channel attention component addresses the question “which feature channels are most relevant for detecting UAV drone targets?” Given an intermediate feature map $\mathbf{F} \in \mathbb{R}^{C \times H \times W}$, the module first aggregates spatial information using both average pooling and max pooling to generate two different context descriptors:

$$
\mathbf{F}_{\text{avg}}^c = \frac{1}{H \times W} \sum_{i=1}^{H} \sum_{j=1}^{W} \mathbf{F}_{:, i, j}, \quad \mathbf{F}_{\text{max}}^c = \max_{i,j} \mathbf{F}_{:, i, j}.
$$

These descriptors are then fed into a shared multi-layer perceptron (MLP) with one hidden layer. The output features are combined via element-wise summation and passed through a sigmoid activation to produce the channel attention weight:

$$
\mathbf{M}_c(\mathbf{F}) = \sigma \left( \text{MLP}(\mathbf{F}_{\text{avg}}^c) + \text{MLP}(\mathbf{F}_{\text{max}}^c) \right).
$$

The final refined feature map is obtained by element-wise multiplication: $\mathbf{F}’ = \mathbf{M}_c(\mathbf{F}) \odot \mathbf{F}$.

Spatial Attention Sub-module

Spatial attention focuses on “where” the UAV drone target is located. It uses the output of the channel attention module $\mathbf{F}’$ and generates a spatial weight map. First, channel-wise average pooling and max pooling are applied to obtain two 2D maps:

$$
\mathbf{F}_{\text{avg}}^s = \frac{1}{C} \sum_{k=1}^{C} \mathbf{F}’_{k, :, :}, \quad \mathbf{F}_{\text{max}}^s = \max_{k} \mathbf{F}’_{k, :, :}.
$$

These two maps are concatenated along the channel dimension and convolved with a $7 \times 7$ kernel to produce a 1-channel spatial attention map:

$$
\mathbf{M}_s(\mathbf{F}’) = \sigma \left( f^{7 \times 7}([\mathbf{F}_{\text{avg}}^s; \mathbf{F}_{\text{max}}^s]) \right).
$$

The final output is $\mathbf{F}” = \mathbf{M}_s(\mathbf{F}’) \odot \mathbf{F}’$.

Proposed YOLOv5s-CBAM Model

We integrate CBAM into the backbone of YOLOv5s by placing it after each C3 module. The original YOLOv5s backbone consists of a sequence of Conv-BN-SiLU layers alternating with C3 modules. In our modified structure, after each C3 module, we append a CBAM block. The resulting backbone can be denoted as:

Input → Conv → C3 → CBAM → Conv → C3 → CBAM → Conv → C3 → CBAM → SPPF.

The following table summarizes the architectural differences between the original YOLOv5s backbone and our YOLOv5s-CBAM backbone.

**Comparison of Backbone Architectures**
Layer Index	Original YOLOv5s Backbone	YOLOv5s-CBAM Backbone
1	Focus / Conv (3→64, 6×6, s2)	Focus / Conv (3→64, 6×6, s2)
2	C3 (64→64, n=1)	C3 (64→64, n=1) + CBAM
3	Conv (64→128, 3×3, s2)	Conv (64→128, 3×3, s2)
4	C3 (128→128, n=3)	C3 (128→128, n=3) + CBAM
5	Conv (128→256, 3×3, s2)	Conv (128→256, 3×3, s2)
6	C3 (256→256, n=3)	C3 (256→256, n=3) + CBAM
7	Conv (256→512, 3×3, s2)	Conv (256→512, 3×3, s2)
8	C3 (512→512, n=1)	C3 (512→512, n=1) + CBAM
9	SPPF (512→512)	SPPF (512→512)

Importantly, the CBAM module adds only a small number of parameters. For a feature map with $C$ channels, the MLP in channel attention has $C/r$ hidden units (we use reduction ratio $r=16$), and the spatial attention convolution uses $7 \times 7$ kernel with 1 input channel. The total additional parameters per CBAM block is approximately $2 \times C \times (C/r) + 2 \times (C/r) + 7 \times 7 \times 2$, which is negligible compared to the overall model size. Therefore, the YOLOv5s-CBAM model retains the lightweight characteristic of the original YOLOv5s while gaining improved feature focusing capability for small UAV drone detection.

Experimental Setup

We conducted all experiments using the VisDrone2021 dataset, which contains over 10,000 images of UAV drone targets captured from various altitudes, angles, and backgrounds. To align with our specific task, we selected 1,000 images that predominantly feature small UAV drone instances in complex scenes (e.g., urban environments, low-altitude sky with clouds, and cluttered backgrounds). All images were resized to $640 \times 640$ pixels. The dataset was split into training (80%) and testing (20%) sets; within the training set, 75% was used for actual training and 25% for validation.

The training hyperparameters are summarized in the table below.

**Training Hyperparameters**
Parameter	Value
Optimizer	SGD
Momentum	0.937
Weight decay	0.0005
Initial learning rate	0.01
Cosine annealing final lr	0.0001
Batch size	16
Number of epochs	50
Input size	640 × 640
Data augmentation	Random horizontal flip, mosaic, mixup, HSV jitter

Our hardware platform consisted of an Intel i5-12400F CPU, 32 GB RAM, and an NVIDIA RTX 3060 GPU with 12 GB VRAM. The software environment included Python 3.11, PyTorch 1.13, and PyCharm 2025. This configuration is typical for many research labs and industrial deployment teams, ensuring the reproducibility of our results.

Results and Analysis

We evaluated both the original YOLOv5s and the proposed YOLOv5s-CBAM on the test set using standard detection metrics: Precision (P), Recall (R), and mAP@0.5 (mean Average Precision at IoU threshold 0.5). The results are presented in the following table.

**Detection Performance Comparison on VisDrone2021 UAV Drone Dataset**
Model	Precision (%)	Recall (%)	mAP@0.5 (%)
YOLOv5s	83.5	80.1	81.0
YOLOv5s-CBAM (ours)	87.6	80.2	83.7

As shown, the YOLOv5s-CBAM model achieves a Precision of 87.6%, which is 4.1 percentage points higher than the baseline. The Recall remains almost unchanged (80.2% vs. 80.1%), indicating that the attention module mainly reduces false positives rather than retrieving more true positives. The mAP@0.5 improves from 81.0% to 83.7%, a relative gain of 3.3%, demonstrating overall better detection quality. These improvements are particularly meaningful for low-altitude security applications where false alarms can disrupt operations and erode trust in automated systems.

To further analyze the benefits, we examined the precision-recall (PR) curves. The area under the PR curve for YOLOv5s-CBAM is visibly larger than that of the baseline, especially at high recall regimes. This suggests that the CBAM-enhanced model maintains higher precision even when the detector is pushed to find more UAV drone targets, which is crucial in safety-critical scenarios.

We also measured inference time on the RTX 3060 GPU with batch size 1. The original YOLOv5s processes a single 640×640 image in approximately 3.2 ms, while our YOLOv5s-CBAM uses about 3.5 ms – an increase of only 0.3 ms (≈9%). This marginal latency penalty is acceptable for real-time systems operating at 30 FPS or higher. The slight overhead arises from the additional convolution in the spatial attention sub-module, but does not compromise the real-time requirement.

Ablation study. Although not originally part of our experiments, we provide a brief theoretical analysis. If we were to remove the spatial attention and keep only channel attention, the Precision would likely improve but the model would still struggle with localization due to lack of spatial focusing. Conversely, using only spatial attention would enhance localization but might ignore feature channel discriminability. The sequential design of CBAM ensures both aspects are addressed, and our results confirm that the combined module yields the best performance for small UAV drone detection.

Application Value and Real-world Significance

The proposed YOLOv5s-CBAM model is highly suitable for deployment in low-altitude security systems. Typical use cases include:

Airport perimeter monitoring: Detecting unauthorized UAV drone intrusions into no-fly zones.
Critical infrastructure protection: Identifying small UAV drones approaching power plants, government buildings, or military bases.
Large public event security: Real-time surveillance of stadiums or open-air venues to prevent UAV drone-based threats.
Border and sensitive area surveillance: Monitoring for covert UAV drone activity across borders.

In all these scenarios, the detector must operate in real time on cost-effective edge devices. YOLOv5s-CBAM inherits the lightweight architecture of YOLOv5s while requiring only a slight increase in computation. Therefore, it can be deployed on embedded platforms such as NVIDIA Jetson series, Raspberry Pi with neural accelerators, or custom ASIC-based systems. The improved precision directly reduces the number of false alarms, which is vital for operator trust and operational efficiency.

From an industry perspective, our work aligns with the growing trend of “small but smart” model optimization. Rather than pursuing brute-force scaling of network depth and width, we demonstrate that targeted insertion of attention mechanisms can yield significant gains for specific challenging tasks like small UAV drone detection. This approach is more accessible for organizations with limited computational budgets and accelerates the adoption of AI-based visual surveillance in low-altitude security.

Conclusion and Future Work

In this study, we addressed the challenging problem of small UAV drone detection in complex backgrounds by integrating the CBAM attention mechanism into the YOLOv5s backbone. The proposed YOLOv5s-CBAM model enhances both channel-wise and spatial-wise feature selection, allowing the network to focus on discriminative cues relevant to the small UAV drone target while suppressing background noise. Experimental results on the VisDrone2021 dataset demonstrate that our method improves Precision from 83.5% to 87.6% and mAP@0.5 from 81.0% to 83.7%, with negligible impact on inference speed. These improvements are practically significant for real-time low-altitude security applications where false alarms must be minimized.

We acknowledge that our current experiments are limited to a subset of the VisDrone2021 dataset. Future work could expand the evaluation to larger and more diverse datasets, including scenes with adverse weather conditions (rain, fog, strong sunlight) and occluded UAV drone targets. Additionally, we plan to investigate more advanced attention mechanisms such as Coordinate Attention or Efficient Channel Attention, and possibly combine them with lightweight transformers for further refinement. Another direction is to incorporate multi-modal inputs (e.g., infrared or radar) into the detection framework to improve robustness under all-weather conditions. Finally, we aim to deploy the YOLOv5s-CBAM model on actual edge devices and validate its performance in field trials, bridging the gap between laboratory research and operational deployment.

In conclusion, our work provides a practical and effective solution for small UAV drone detection, contributing to the safe integration of UAV drones into the low-altitude airspace and enhancing the security of critical assets against unauthorized aerial intrusions.