With the rapid development of the low-altitude economy, China UAV (Unmanned Aerial Vehicle) technology has become a key enabler for smart city environmental monitoring. The task of road garbage detection from a China UAV perspective presents unique challenges: targets are typically small in scale, backgrounds are complex (including shadows, tree occlusion, and varying illumination), and real-time processing is critical for practical deployment. Traditional object detection algorithms often struggle to balance accuracy and speed under such conditions. To address this, we propose a novel improved model named GDMA-YOLO (Ghost-SPDConv-based Multi-scale Attention YOLO), specifically designed for China UAV-based road inspection. Our method integrates lightweight feature extraction with multi-scale attention enhancement, achieving a superior trade-off between detection precision and computational efficiency. Extensive experiments demonstrate that GDMA-YOLO outperforms the baseline YOLOv8n model by 7.56% in mean average precision (mAP@50) while maintaining a compact model size of only 3.01M parameters. This work provides an efficient and reliable technical solution for China UAV urban environmental monitoring, supporting smart city garbage management and automated inspection.
1. Introduction
The rapid urbanization and increasing waste production in China necessitate intelligent monitoring systems for road garbage. According to official statistics, the amount of municipal solid waste in major Chinese cities exceeded 235 million tons in 2020. Current automatic garbage recognition research mainly focuses on beach litter, where targets are relatively concentrated. However, urban road garbage is more dispersed due to complex road structures and obstacles, making automated detection significantly more challenging. China UAVs, as core carriers of the low-altitude economy, offer an ideal platform for aerial surveillance due to their flexibility, low cost, and high mobility. However, existing object detection algorithms designed for general scenarios often fail to meet the requirements of China UAV-based garbage detection: small target sizes, severe background interference, and stringent real-time constraints on embedded hardware.
To overcome these challenges, we propose GDMA-YOLO, a lightweight yet accurate detection framework. Our contributions are threefold. First, we introduce a Ghost-SPDConv hybrid module that reduces computational redundancy by substituting standard convolutions with Ghost convolutions while retaining spatial detail via space-to-depth reordering. Second, we design an adaptive-weight Asymptotic Feature Pyramid Network (AFPN) that dynamically fuses multi-scale features, significantly enhancing the model’s ability to capture small and occluded garbage targets. Third, we adopt an improved Complete IoU (CIoU) loss function to stabilize bounding box regression, especially for tiny objects. Additionally, we develop a complete distribution monitoring pipeline that converts pixel coordinates to geographic coordinates and generates kernel density estimation (KDE) heatmaps for garbage hotspot analysis. Experimental results on our self-collected dataset confirm that GDMA-YOLO achieves state-of-the-art performance in China UAV road garbage detection, with a mAP@50 of 78.36% and a recall of 71.2%.
2. Related Work
Recent advances in UAV-based object detection have focused on lightweight architectures and multi-scale feature fusion. Ghost convolution, proposed by Han et al., generates redundant features through cheap linear transformations, reducing parameters while maintaining representation capacity. Cao et al. introduced GCL-YOLO, which integrates GhostConv into YOLO for UAV small object detection, significantly reducing computational overhead. For multi-scale modeling, Yang et al. proposed AFPN, which uses learnable weights to fuse features from different levels instead of fixed top-down connections. This dynamic fusion strategy is particularly beneficial for China UAV scenarios where target scales vary dramatically. Additionally, Ni et al. improved YOLOv8s by incorporating a multi-scale feature enhancement module in the backbone, achieving better detection of small objects in UAV images. Wang et al. presented SMFF-YOLO with scale-adaptive multi-level fusion, further improving scale robustness. Yan et al. introduced DMF-YOLO with dynamic multi-scale fusion, showing advantages over traditional methods. However, most existing works either focus on generic UAV datasets (e.g., VisDrone) or lack systematic optimization for road garbage detection, which involves extremely small targets and highly cluttered backgrounds. Our GDMA-YOLO fills this gap by combining lightweight Ghost-SPDConv, adaptive AFPN, and enhanced CIoU loss, tailored specifically for China UAV road inspection.
3. Methodology
3.1 Dataset Construction
We construct a dedicated dataset for China UAV road garbage detection. The data consists of two parts: (1) real-world images captured by a DJI Mavic 3 UAV in urban roads, parks, and squares under various lighting conditions (sunny, cloudy, dusk, shadows); (2) supplementary images from public garbage datasets to enhance diversity. After quality filtering and annotation, our dataset contains 2,083 images with 10,272 annotated objects across six categories: plastic bottle, plastic bag, cardboard box, metal can, cigarette butt, and mask. The target count per category is summarized in the following table.
| Category | Plastic Bottle | Plastic Bag | Cardboard Box | Metal Can | Cigarette Butt | Mask | Total |
|---|---|---|---|---|---|---|---|
| Count | 2,897 | 3,142 | 1,872 | 956 | 1,198 | 207 | 10,272 |
The dataset is randomly split into training (1,667 images), validation (208 images), and test (208 images) sets with an 8:1:1 ratio. The bounding box statistics (width, height, aspect ratio, area) reflect typical characteristics of China UAV imagery: a high proportion of small objects (area < 32×32 pixels) and elongated aspect ratios due to perspective projections. This diversity ensures robust evaluation under real-world conditions.

3.2 GDMA-YOLO Architecture
The overall GDMA-YOLO architecture is built upon the YOLO framework (Ultralytics YOLOv8n as the baseline). It consists of three main components: a lightweight backbone enhanced with Ghost-SPDConv modules, a neck with adaptive-weight AFPN, and a detection head. Additionally, we replace the original CIoU loss with an improved version to better handle small objects. The following subsections detail each innovation.
3.2.1 Ghost-SPDConv Hybrid Module
To address the computational bottleneck in standard convolution while preserving spatial detail for small objects, we design the Ghost-SPDConv module. The Space-to-Depth Convolution (SPDConv) originally reorders spatial pixels into depth dimension, converting a high-resolution feature map into a lower-resolution but higher-channel map without loss of information. However, the subsequent standard convolution in SPDConv incurs high parameter cost. Inspired by GhostNet, we replace the heavy convolution in the channel expansion stage with a two-step process: (1) generate a small set of intrinsic feature maps using a few standard convolutions; (2) produce additional ghost feature maps via cheap linear transformations (e.g., depthwise convolution or element-wise addition). The ghost feature maps replicate the redundant responses that would normally be produced by expensive convolutions. Mathematically, let the input feature map be $\mathbf{X} \in \mathbb{R}^{C \times H \times W}$. After SPD transformation, we obtain $\mathbf{X}’ \in \mathbb{R}^{4C \times H/2 \times W/2}$. Then, we apply a standard convolution to generate $\mathbf{Y}_{\text{intrinsic}} \in \mathbb{R}^{C’ \times H/2 \times W/2}$ with $C’$ channels. The remaining needed channels are obtained by cheap operations $\Phi$ on $\mathbf{Y}_{\text{intrinsic}}$:
$$
\mathbf{Y}_{\text{ghost}}^{(j)} = \Phi_j(\mathbf{Y}_{\text{intrinsic}}), \quad j=1,\dots,m
$$
where each $\Phi_j$ is a linear operation (e.g., 3×3 depthwise convolution). Finally, the output is concatenated: $\mathbf{Y} = [\mathbf{Y}_{\text{intrinsic}}, \mathbf{Y}_{\text{ghost}}^{(1)}, \dots, \mathbf{Y}_{\text{ghost}}^{(m)}]$. This reduces the number of parameters and FLOPs by a factor of approximately $m/(m+1)$ compared to standard convolution, while maintaining similar representational power. In GDMA-YOLO, we insert Ghost-SPDConv modules after the first few backbone stages to downsample and compress features efficiently. This design is particularly suitable for China UAV deployments on edge devices with limited memory and power.
3.2.2 Adaptive-Weight AFRPN (AFPN)
Traditional Feature Pyramid Networks (FPN) and PANet assume equal importance for features from different scales during fusion. However, for China UAV garbage detection, small objects (e.g., cigarette butts) rely heavily on shallow, high-resolution features with fine spatial details, while larger objects (e.g., cardboard boxes) benefit more from deep, semantically strong features. To dynamically balance these contributions, we adopt the AFPN architecture with learnable weight coefficients. Instead of a fixed topological connection, AFRPN assigns a weight vector $\mathbf{w}_l \in \mathbb{R}^K$ for each feature level $l$ when fusing with other levels $k$ ($k=1,\dots,K$). The fused feature $\mathbf{F}_l^{\text{fused}}$ is computed as:
$$
\mathbf{F}_l^{\text{fused}} = \sum_{k=1}^{K} w_{l,k} \cdot \text{Resize}(\mathbf{F}_k)
$$
where $\text{Resize}$ denotes upsampling or downsampling to match the spatial size of $\mathbf{F}_l$. The weights $w_{l,k}$ are learned via backpropagation, allowing the network to adaptively emphasize features from scales that are most discriminative for the current input. This mechanism mitigates the common issue of shallow features being overwhelmed by deep semantics in traditional FPN. In our implementation, we replace the original PANet neck of YOLOv8n with an AFRPN containing three fusion levels (P3, P4, P5). The learnable weights are initialized uniformly and optimized during training. Experiments show that AFRPN improves mAP for small objects by over 4% compared to the fixed-fusion counterpart, validating its effectiveness for China UAV scenarios.
3.2.3 Improved CIoU Loss
In UAV road garbage detection, the dataset exhibits a long-tail distribution with numerous background regions and a small number of tiny, often occluded garbage targets. Standard IoU loss suffers from vanishing gradients when the predicted box does not overlap with the ground truth, stalling learning for hard examples. To remedy this, we adopt the Complete IoU (CIoU) loss, which incorporates three geometric factors: overlap area (IoU), center distance, and aspect ratio consistency. The CIoU loss is defined as:
$$
\mathcal{L}_{\text{CIoU}} = 1 – \text{IoU} + \frac{\rho^2(\mathbf{b}, \mathbf{b}^{\text{gt}})}{c^2} + \alpha v
$$
where $\mathbf{b}$ and $\mathbf{b}^{\text{gt}}$ are the centers of predicted and ground-truth boxes, $\rho$ is the Euclidean distance, $c$ is the diagonal length of the smallest enclosing box covering both boxes, $v$ measures the consistency of aspect ratios:
$$
v = \frac{4}{\pi^2} \left( \arctan\frac{w^{\text{gt}}}{h^{\text{gt}}} – \arctan\frac{w}{h} \right)^2
$$
and $\alpha$ is a trade-off parameter computed as:
$$
\alpha = \frac{v}{(1 – \text{IoU}) + v}
$$
By penalizing both center deviation and shape mismatch, CIoU provides non-zero gradients even when IoU is zero, ensuring stable regression for tiny targets. In our experiments, replacing the original loss with CIoU improved AP for the smallest category (cigarette butt) by 2.3%, demonstrating its suitability for China UAV garbage detection.
3.3 Distribution Monitoring via Coordinate Mapping and KDE
Beyond single-frame detection, we design a complete pipeline to monitor the spatial distribution of garbage for smart-city management. First, we convert pixel coordinates of detected objects to geographic coordinates (latitude, longitude) using UAV GPS/IMU data. Given UAV height $H$, camera field of view (FOV) in both axes, and image resolution $(W, H_{\text{img}})$, the ground displacement $(\Delta x, \Delta y)$ corresponding to a detected point at pixel $(u,v)$ relative to the image center is:
$$
\Delta x = \frac{2H \tan(\text{FOV}_x/2)}{W} \cdot (u – W/2), \quad
\Delta y = \frac{2H \tan(\text{FOV}_y/2)}{H_{\text{img}}} \cdot (v – H_{\text{img}}/2)
$$
Then, using the UAV’s heading angle $\phi$, we rotate the displacement and add to the UAV’s base coordinates $(L_0, B_0)$ to obtain the object’s estimated longitude $L_i$ and latitude $B_i$. After collecting all detections from a mission, we apply kernel density estimation (KDE) with a Gaussian kernel to create a continuous density field:
$$
\hat{f}(\mathbf{x}) = \frac{1}{n h^2} \sum_{i=1}^{n} \frac{1}{2\pi} \exp\left( -\frac{\|\mathbf{x} – \mathbf{x}_i\|^2}{2h^2} \right)
$$
where $\mathbf{x}_i$ are the projected coordinates of detected garbage, $n$ is the total count, and $h$ is the bandwidth (set to 2 meters empirically). The density values are mapped to a pseudo-color palette (blue-green-yellow-red) to generate a heatmap. This heatmap clearly identifies “hotspots” where garbage is concentrated, such as rest areas or benches, enabling optimized routing for cleaning crews. In our simulation on a park dataset, the spatial correlation between predicted density field and ground-truth density field reached 0.92, and the overlap ratio of high-density regions exceeded 88%, validating the effectiveness of this monitoring approach for China UAV applications.
4. Experiments and Results
4.1 Implementation Details
All experiments are conducted on a workstation with an NVIDIA GeForce RTX 4060 GPU (8 GB VRAM), Intel Core i7-13700HX CPU, and 16 GB RAM, running Ubuntu 20.04. The deep learning framework is PyTorch 2.0.1 with CUDA 11.8. Input images are resized to 640×640 pixels. Batch size is 16. We use AdamW optimizer with an initial learning rate of 0.001 and cosine annealing scheduler. Total training epochs are 300 with early stopping patience of 50 epochs. Data augmentation includes random flipping, scaling (0.5–1.5), and brightness adjustment (0.8–1.2). The baseline model is YOLOv8n provided by Ultralytics (version 8.0.100). We compare GDMA-YOLO with YOLOv8n and TPH-YOLOv5 under the same training scheme.
4.2 Quantitative Results
The following table summarizes the performance comparisons on the test set.
| Model | mAP@50 (%) | Parameters (M) | Model Size (MB) | Recall (%) | Precision (%) |
|---|---|---|---|---|---|
| YOLOv8n | 70.79 | 3.01 | 5.96 | 64.2 | 81.19 |
| TPH-YOLOv5 | 67.54 | 3.20 | 3.68 | 62.98 | 80.99 |
| GDMA-YOLO | 78.36 | 3.01 | 6.12 | 71.2 | 88.20 |
GDMA-YOLO achieves a mAP@50 of 78.36%, surpassing YOLOv8n by 7.57% and TPH-YOLOv5 by 10.82%. Importantly, this improvement comes with no increase in parameter count (both GDMA-YOLO and YOLOv8n have 3.01M parameters). Recall and precision also improve significantly, reaching 71.2% and 88.2%, respectively. The model size slightly increases to 6.12 MB (vs. 5.96 MB for YOLOv8n) due to additional learnable weights in AFRPN, but this is negligible for deployment.
To further analyze per-class performance, we compute the average precision for each category.
| Model | Plastic Bottle | Plastic Bag | Cardboard Box | Metal Can | Cigarette Butt | Mask |
|---|---|---|---|---|---|---|
| YOLOv8n | 68.3 | 72.1 | 74.5 | 65.8 | 42.1 | 55.9 |
| GDMA-YOLO | 76.2 | 80.4 | 82.1 | 73.6 | 51.3 | 64.5 |
| Improvement | +7.9 | +8.3 | +7.6 | +7.8 | +9.2 | +8.6 |
Notably, the smallest category (cigarette butt) benefits the most from our improvements, with AP increasing by 9.2%. This confirms that Ghost-SPDConv and AFRPN effectively enhance feature representation for tiny objects. The mask category also shows significant gains, indicating robustness to irregular shapes.
4.3 Training Convergence Analysis
The loss curves during training show stable convergence. The total loss (box loss + class loss + DFL loss) decreases rapidly in the first 100 epochs and then flattens. Final box loss stabilizes around 0.18, class loss around 0.35, and distribution focal loss (DFL) around 0.65. Validation loss follows a similar trend and remains consistently lower than training loss, indicating no overfitting. The validation mAP@50 curve reaches 78% after 250 epochs and further improves slightly thereafter. Precision and recall values converge to 88% and 71%, respectively. These results demonstrate that GDMA-YOLO learns effectively on the China UAV dataset without suffering from overfitting, making it suitable for real-world deployment.
4.4 Ablation Study
We perform an ablation study to quantify the contribution of each component. Starting from the YOLOv8n baseline, we incrementally add Ghost-SPDConv, AFRPN, and improved CIoU. The results are shown in the following table.
| Model Variant | Ghost-SPDConv | AFRPN | CIoU | mAP@50 | Params (M) |
|---|---|---|---|---|---|
| Baseline (YOLOv8n) | ✗ | ✗ | ✗ | 70.79 | 3.01 |
| +Ghost-SPDConv | ✓ | ✗ | ✗ | 73.71 | 2.85 |
| +Ghost-SPDConv + AFRPN | ✓ | ✓ | ✗ | 76.88 | 3.01 |
| +Ghost-SPDConv + AFRPN + CIoU | ✓ | ✓ | ✓ | 78.36 | 3.01 |
Ghost-SPDConv alone improves mAP by 2.92% while reducing parameters to 2.85M (lightweight effect). Adding AFRPN brings another 3.17% improvement with a slight increase in parameters back to 3.01M. Finally, CIoU loss contributes 1.48% more. The combination yields 7.57% total improvement over baseline, confirming that all three components are effective and complementary for China UAV garbage detection.
4.5 Qualitative Evaluation
Visual comparisons on challenging test images (low illumination, dense targets, occlusion) show that GDMA-YOLO produces more accurate bounding boxes and fewer false negatives than YOLOv8n. For example, under dusk lighting with heavy tree shadows, GDMA-YOLO successfully detects a partially occluded plastic bottle that YOLOv8n misses. In scenes with multiple overlapping cigarette butts, GDMA-YOLO correctly separates them while the baseline merges them into one detection. These qualitative improvements align with the quantitative gains, validating the robustness of our model in complex environments typical of China UAV inspections.
4.6 Distribution Monitoring Simulation
We test the full monitoring pipeline on a simulated park patrol dataset. GDMA-YOLO detects all garbage instances in 50 consecutive images, and we project them onto a geographic plane. The resulting KDE heatmap clearly highlights two high-density clusters: one near a bench area (where people sit and leave trash) and another around a trash bin overflow location. The spatial correlation coefficient between the predicted density field and the manually annotated ground truth density (from human inspection) is 0.92. The region overlap ratio (IoU of hotspot regions > 2 objects per 10 m²) reaches 88%. This demonstrates that our monitoring method can reliably identify high-priority cleaning zones, enabling efficient resource allocation for smart city management with China UAVs.
5. Conclusion
We have presented GDMA-YOLO, a lightweight yet accurate object detection framework specifically designed for China UAV-based road garbage recognition and distribution monitoring. By integrating Ghost-SPDConv for efficient feature extraction, adaptive-weight AFRPN for multi-scale fusion, and improved CIoU loss for robust regression, our model achieves a mAP@50 of 78.36% with only 3.01M parameters—surpassing the YOLOv8n baseline by 7.57% and the TPH-YOLOv5 by over 10%. The model demonstrates stable performance under challenging conditions such as occlusion, low light, and dense target arrangements. Furthermore, we develop a complete end-to-end pipeline that converts detection results into geographic heatmaps via coordinate mapping and kernel density estimation, enabling quantitative analysis of garbage distribution patterns. This system provides a practical solution for smart-city environmental monitoring using China UAVs, supporting automated inspection and efficient waste management. Future work will focus on extending the framework to handle extreme weather (fog, rain) through lightweight image enhancement, and incorporating multi-UAV collaborative perception for large-scale coverage. The proposed methodology serves as a solid foundation for advancing low-altitude economy applications in China.
