In recent years, the rapid advancement of unmanned aerial vehicle (UAV) technology has revolutionized various fields, including military reconnaissance, environmental monitoring, traffic management, and rescue operations, particularly in China where UAV drone deployments are increasingly prevalent for surveillance and data collection. However, UAV-captured images often present significant challenges for target detection algorithms due to complex backgrounds, wide viewing angles, object occlusions, dense small targets, and indistinct features, further exacerbated by factors like target similarity and varying lighting conditions. These issues lead to suboptimal detection accuracy in UAV scenarios, prompting the need for specialized algorithms tailored to aerial photography. This paper addresses these challenges by proposing CSP-YOLO (Cross Stage Partial-You Only Look Once), an improved YOLOv8n-based algorithm designed for small target detection in UAV aerial imagery. Our approach integrates several enhancements to boost precision and robustness, including the Space-to-Depth Convolution (SPD-Conv) module, a novel C-SimAM (Convolution Simple Attention Module) attention mechanism, a tiny prediction head for dense small targets, and the Wise Intersection over Union (WIoU) loss function. We evaluate CSP-YOLO on the VisDrone2019 dataset, demonstrating superior performance compared to existing methods while maintaining a balance between accuracy and model size. This work contributes to the growing body of research on UAV-based applications, with potential implications for China UAV drone systems in real-world tasks such as urban monitoring and disaster response.
The field of object detection has seen remarkable progress with deep learning, broadly categorized into two-stage and single-stage algorithms. Two-stage methods, like R-CNN and Faster R-CNN, generate region proposals before classification, offering high accuracy but at the cost of computational complexity, which may hinder real-time processing in UAV drone operations. In contrast, single-stage algorithms, such as SSD, RetinaNet, and the YOLO series, perform localization and classification directly from input images, providing a better speed-accuracy trade-off suitable for UAV applications where real-time feedback is crucial. Among these, YOLO variants have gained popularity due to their efficiency and effectiveness. For instance, YOLOv8n, a lightweight version of YOLOv8, is often employed in resource-constrained environments like China UAV drone platforms. However, despite its advantages, YOLOv8n struggles with small target detection in aerial imagery due to feature loss during downsampling, low-resolution feature recognition, and imbalanced sample distributions. Previous attempts to improve YOLO for UAV scenarios include incorporating attention mechanisms, modifying neck structures, and using advanced loss functions, but these often increase model complexity or fail to adequately address tiny objects. Our CSP-YOLO algorithm builds upon YOLOv8n by introducing targeted modifications that enhance small target detection without substantially inflating parameters, making it ideal for China UAV drone deployments in diverse scenarios.

To tackle the issue of feature loss in small targets, we design the Space-to-Depth Convolution (SPD-Conv) module, which replaces strided convolutions in the backbone network. Traditional convolutions with strides can discard fine-grained information, especially detrimental for small objects in UAV drone imagery. The SPD-Conv module combines a space-to-depth layer and a non-strided convolution layer to preserve spatial details. Given an input feature map of size \(S \times S \times C_1\), where \(S\) is the spatial dimension and \(C_1\) is the number of channels, the space-to-depth layer partitions the map into four sub-features, each corresponding to a quadrant of the original. These are then concatenated along the channel dimension, resulting in a feature map of size \(\frac{S}{2} \times \frac{S}{2} \times 4C_1\). This is followed by a standard convolution to produce an output of size \(\frac{S}{2} \times \frac{S}{2} \times C_2\), effectively retaining resolution while learning richer features. The process can be summarized as: for an input \(X\), the SPD operation yields \(X’ = \text{Concat}(X_{00}, X_{01}, X_{10}, X_{11})\), where \(X_{ij}\) denotes sub-regions, and then applies a convolution. This approach minimizes information loss, crucial for detecting small targets in China UAV drone footage where objects may occupy few pixels.
Another key improvement is the C-SimAM attention mechanism, an enhanced version of the SimAM module, designed to boost low-resolution feature recognition in complex backgrounds. SimAM computes 3D attention weights by optimizing an energy function to identify neuron importance. For an input feature map \(X \in \mathbb{R}^{H \times W \times C}\), the energy \(e_t\) for a neuron \(t\) is defined as:
$$
e_t = \frac{4(\sigma^2 + \lambda)}{(t – \mu)^2 + 2\sigma^2 + 2\lambda}
$$
where \(\mu = \frac{1}{M}\sum_{i=1}^{M} x_i\) and \(\sigma^2 = \frac{1}{M}\sum_{i=1}^{M} (x_i – \mu)^2\) are the mean and variance of all neurons in a channel, \(M = H \times W\), and \(\lambda\) is a regularization coefficient. The attention output is \(X’ = \text{Sigmoid}\left(\frac{1}{E}\right) \odot X\), with \(E\) aggregating all \(e_t\) values. However, SimAM may overlook weak features from small targets. To address this, we propose C-SimAM by incorporating a residual connection and a \(1 \times 1\) convolution:
$$
Y = X + \text{Conv}_{1\times1}(X’)
$$
where \(Y\) is the output feature map. This residual enhancement reinforces salient features while retaining subtle details, improving detection accuracy for occluded or distant objects in China UAV drone images. We integrate C-SimAM into the backbone and neck networks after feature concatenation, enabling better multi-scale fusion.
For dense small target detection, we redesign the neck structure by adding a tiny prediction head. YOLOv8n originally uses three detection heads for large, medium, and small targets (e.g., \(80 \times 80\), \(40 \times 40\), \(20 \times 20\) feature maps). To better capture objects smaller than \(4 \times 4\) pixels, common in UAV aerial photography, we introduce an additional head with a \(160 \times 160\) resolution. This higher-resolution head extracts fine-grained information from shallow feature maps, increasing the coverage and accuracy for tiny targets. The modification optimizes focus on dense clusters, such as groups of vehicles or pedestrians in China UAV drone surveillance scenarios, without significantly compromising speed.
To handle imbalanced positive and negative samples, we replace the CIoU loss with Wise-IoU (WIoU) for bounding box regression. WIoU introduces a dynamic gradient allocation mechanism that reduces the impact of low-quality samples. The loss is defined as:
$$
L_{IoU} = 1 – IoU
$$
$$
R_{WIoU} = \exp\left(\frac{(x – x_{gt})^2 + (y – y_{gt})^2}{(W_g^2 + H_g^2)^*}\right)
$$
$$
L_{WIoUv1} = R_{WIoU} L_{IoU}
$$
where \((x, y)\) and \((x_{gt}, y_{gt})\) are the predicted and ground-truth box centers, and \(W_g\) and \(H_g\) are the width and height of the ground-truth box. The term \((W_g^2 + H_g^2)^*\) serves as a normalization factor. For further refinement, we use \(L_{WIoUv3}\) with a monotonic focusing coefficient:
$$
\beta = \frac{L_{IoU}^*}{\overline{L_{IoU}}}, \quad L_{WIoUv3} = \beta^{\delta \alpha \beta – \delta} L_{WIoUv1}
$$
where \(\overline{L_{IoU}}\) is a running mean, and \(\alpha\) and \(\delta\) are hyperparameters set to 1.9 and 3.0, respectively. This loss accelerates convergence and improves localization precision, essential for accurate detection in China UAV drone applications where target boundaries can be ambiguous.
We conduct extensive experiments on the VisDrone2019 dataset, a benchmark for UAV object detection containing 8,629 images with annotations for ten classes, such as pedestrians, cars, and buses. The dataset reflects real-world challenges in China UAV drone imagery, including varied weather conditions and urban clutter. Our experimental setup uses an NVIDIA RTX 3060 GPU with Ubuntu 22.04, Python 3.8, and CUDA 11.7. We train models for 200 epochs with a batch size of 4, input size of \(640 \times 640\), SGD optimizer, initial learning rate of 0.01, and weight decay of 0.0005. Evaluation metrics include precision (P), recall (R), mean average precision at IoU threshold 0.5 (mAP0.5), mAP over IoU thresholds 0.5 to 0.95 (mAP0.5:0.95), parameters (params), and GFLOPs. Precision and recall are calculated as:
$$
P = \frac{TP}{TP + FP} \times 100\%, \quad R = \frac{TP}{TP + FN} \times 100\%
$$
where TP, FP, and FN denote true positives, false positives, and false negatives. The average precision (AP) and mAP are:
$$
AP = \int_0^1 P(R) \, dR, \quad mAP = \frac{1}{N} \sum_{i=1}^{N} AP_i \times 100\%
$$
with \(N\) as the number of classes. We perform ablation studies to validate each component of CSP-YOLO, as summarized in Table 1.
| No. | Method | P (%) | R (%) | mAP0.5 (%) | mAP0.5:0.95 (%) | Params (M) | GFLOPs (G) |
|---|---|---|---|---|---|---|---|
| 1 | Baseline (YOLOv8n) | 44.0 | 32.5 | 32.9 | 19.1 | 3.0 | 8.1 |
| 2 | + C-SimAM | 44.7 | 33.9 | 33.9 | 19.7 | 3.2 | 9.1 |
| 3 | + SPD-Conv | 47.6 | 35.8 | 36.4 | 21.5 | 3.4 | 12.5 |
| 4 | + Tiny Head | 48.4 | 36.9 | 37.6 | 22.5 | 3.1 | 13.6 |
| 5 | + WIoU | 47.5 | 38.3 | 38.6 | 23.1 | 3.1 | 15.7 |
| 6 | Combined (CSP-YOLO) | 50.2 | 39.5 | 40.5 | 24.3 | 3.3 | 17.1 |
Table 1 shows that CSP-YOLO achieves the best performance, with mAP0.5 of 40.5% and mAP0.5:0.95 of 24.3%, representing relative improvements of approximately 23.1% and 27.2% over YOLOv8n, respectively. The incremental gains from each module highlight their effectiveness: SPD-Conv reduces feature loss, C-SimAM enhances attention to weak features, the tiny head improves small target coverage, and WIoU optimizes regression. Notably, the parameter increase is modest, from 3.0M to 3.3M, ensuring suitability for China UAV drone systems with limited computational resources.
We further compare CSP-YOLO with state-of-the-art methods on VisDrone2019, as detailed in Table 2. The algorithms include both single-stage and two-stage detectors, with metrics focusing on accuracy and efficiency.
| No. | Method | mAP0.5 (%) | mAP0.5:0.95 (%) | Params (M) | GFLOPs (G) |
|---|---|---|---|---|---|
| 1 | SSD | 23.7 | 12.9 | 24.50 | 87.90 |
| 2 | RetinaNet | 21.2 | 11.8 | 36.39 | 127.82 |
| 3 | Faster R-CNN | 34.1 | 18.6 | 43.80 | 187.00 |
| 4 | YOLOv3 | 38.3 | 23.1 | 61.50 | 154.90 |
| 5 | YOLOv5s | 32.5 | 17.6 | 7.20 | 16.00 |
| 6 | YOLOv6m | 31.7 | 21.7 | 34.20 | 82.00 |
| 7 | YOLOv7 | 39.6 | 22.6 | 37.20 | 104.80 |
| 8 | YOLOv7-tiny | 35.5 | 18.4 | 6.00 | 13.30 |
| 9 | YOLOv8n | 32.9 | 19.1 | 3.00 | 8.10 |
| 10 | YOLOv8s | 40.1 | 23.8 | 11.10 | 28.70 |
| 11 | Drone-YOLO | 31.0 | 17.5 | 3.05 | 10.70 |
| 12 | YOLOv10s | 38.0 | 22.8 | 8.00 | 24.50 |
| 13 | YOLOv11s | 38.6 | 23.0 | 9.40 | 21.30 |
| 14 | CSP-YOLO (Ours) | 40.5 | 24.3 | 3.30 | 17.10 |
CSP-YOLO outperforms most competitors, achieving the highest mAP0.5 among lightweight models and comparable mAP0.5:0.95 to larger ones like YOLOv8s. Its efficiency makes it ideal for China UAV drone applications, where real-time processing and low resource consumption are paramount. For instance, in China, UAV drones are used for traffic monitoring and border surveillance, requiring robust detection algorithms that can operate on edge devices. The improvements in CSP-YOLO directly address these needs by enhancing small target detection without excessive computational overhead.
Visual analysis confirms the superiority of CSP-YOLO in diverse UAV scenarios. We compare detection results on VisDrone2019 images under conditions such as public facilities, strong lighting, occlusions, low light, and urban roads. In public facility scenes, YOLOv8n often misses distant small vehicles, while CSP-YOLO successfully detects them due to the SPD-Conv module’s feature retention. Under strong lighting, CSP-YOLO reduces false positives by leveraging C-SimAM to focus on relevant features. In occlusion scenarios, the tiny prediction head helps identify partially hidden objects, such as pedestrians behind obstacles. Low-light environments challenge many detectors, but CSP-YOLO detects targets as small as \(12 \times 25\) pixels, thanks to the enhanced attention mechanism. On urban roads, CSP-YOLO shows excellent performance in identifying dense vehicle clusters, a common sight in China UAV drone footage from city centers. These visual outcomes underscore the algorithm’s robustness and practicality for real-world UAV deployments.
The integration of these components into CSP-YOLO yields a cohesive framework for small target detection. The backbone network incorporates SPD-Conv at downsampling stages to preserve spatial details. The neck network uses PAN-FPN structure with C-SimAM modules after feature concatenation to refine multi-scale features. The head network includes four prediction heads (\(160 \times 160\), \(80 \times 80\), \(40 \times 40\), \(20 \times 20\)) to handle targets of varying sizes. Training employs WIoU loss for bounding box regression, with the total loss function combining classification and regression terms. The overall architecture balances depth and width, ensuring efficiency for China UAV drone systems that may operate in remote areas with limited connectivity.
Beyond technical contributions, this work has implications for the broader adoption of UAV technology in China. As China UAV drone usage expands in agriculture, disaster management, and security, accurate small target detection becomes critical. For example, in agricultural monitoring, drones need to spot pest infestations or crop diseases early, which appear as small anomalies in images. CSP-YOLO’s ability to detect such targets can improve yield predictions and resource allocation. Similarly, in search-and-rescue missions, drones must locate individuals in cluttered environments, where small targets like human figures are easily overlooked. Our algorithm’s enhancements directly address these challenges, offering a reliable tool for China’s UAV drone operators.
Future work could explore further optimizations for CSP-YOLO. One direction is to incorporate adaptive mechanisms for dynamic environments, such as varying altitudes or weather conditions common in China UAV drone flights. Another is to reduce inference time through quantization or pruning, making the algorithm even more suitable for edge deployment. Additionally, expanding the evaluation to other UAV datasets, like UAVDT or DOTA, could validate generalization across different geographic regions and use cases. We also plan to investigate fusion techniques with other sensor data, such as LiDAR or thermal imaging, to enhance detection in low-visibility scenarios. These efforts aim to solidify CSP-YOLO as a benchmark for UAV-based object detection, particularly in China where drone technology is rapidly evolving.
In conclusion, we propose CSP-YOLO, an improved YOLOv8n-based algorithm for small target detection in UAV aerial photography. By integrating SPD-Conv, C-SimAM, a tiny prediction head, and WIoU loss, CSP-YOLO achieves significant performance gains on the VisDrone2019 dataset, with a mAP0.5 of 40.5% and a balanced model size. The algorithm addresses key challenges in UAV imagery, such as feature loss, low-resolution recognition, dense small targets, and sample imbalance, making it highly applicable for China UAV drone applications in surveillance, monitoring, and beyond. Our contributions advance the state-of-the-art in real-time object detection and provide a practical solution for leveraging UAV technology in diverse fields. As China continues to invest in drone infrastructure, tools like CSP-YOLO will play a vital role in unlocking the full potential of aerial data analysis.
