In recent years, the rapid expansion of highway networks, particularly in China, has necessitated efficient and accurate methods for road maintenance and inspection. Potholes, as a common form of pavement distress, not only compromise road safety but also increase vehicle operating costs and environmental impacts. Traditional inspection methods, such as manual surveys and dedicated inspection vehicles, are often time-consuming, costly, and prone to human error, especially in complex or congested road environments. With advancements in unmanned aerial vehicle (UAV) technology, aerial imaging has emerged as a promising solution for large-scale, non-intrusive road condition assessment. The use of China UAV drone systems enables high-resolution image capture from various altitudes, providing comprehensive coverage of highway segments. However, detecting potholes in aerial images presents significant challenges, including multi-scale variations, low pixel occupancy of small potholes, and visual similarities with background interferences like stains, shadows, or repair patches. These factors often lead to missed detections and reduced accuracy in automated inspection systems.
To address these issues, we propose an enhanced deep learning-based algorithm for pothole detection in aerial images captured by China UAV drone platforms. Our approach builds upon the YOLOv11 architecture, a state-of-the-art object detection model known for its balance between speed and accuracy. We introduce two key modifications: a Lightweight Enhanced Detection Module (LEDM) in the backbone network to improve feature extraction for multi-scale potholes, and an Enhanced Multi-scale Attention Fusion Module (EMSA) in the neck network to enhance feature fusion across different scales. These innovations aim to boost detection performance, particularly for weak-feature targets in complex aerial scenes, while maintaining computational efficiency suitable for deployment on resource-constrained devices. This work contributes to the growing field of intelligent transportation systems, offering a practical tool for highway agencies to automate pothole inspection using China UAV drone technology.
The remainder of this article is organized as follows: First, we review related work on pavement distress detection and UAV-based applications. Next, we detail the methodology, including the architectural improvements to YOLOv11. We then present experimental results on a custom dataset, with extensive evaluations using metrics such as mean Average Precision (mAP) and recall. Finally, we conclude with a summary of findings and future research directions.

Related Work
Pavement distress detection has evolved from manual inspections to automated computer vision techniques. Early methods relied on handcrafted features, such as texture descriptors or edge detectors, combined with machine learning classifiers like Support Vector Machines (SVMs). However, these approaches often struggled with generalization across diverse road conditions and lighting variations. The advent of deep learning, particularly Convolutional Neural Networks (CNNs), revolutionized the field by enabling end-to-end learning of discriminative features from raw images. Models like Faster R-CNN, SSD, and the YOLO series have been widely adopted for object detection tasks, including road damage identification.
In the context of UAV-based inspection, several studies have leveraged deep learning for pothole and crack detection. For instance, some researchers have used YOLOv3 or YOLOv5 with modifications to handle aerial perspectives, often incorporating attention mechanisms or multi-scale fusion strategies. Others have explored data augmentation techniques, such as Generative Adversarial Networks (GANs), to synthesize training data and improve model robustness. However, challenges persist in detecting small potholes with weak features and distinguishing them from similar-looking artifacts in aerial imagery. Our work builds on these efforts by specifically tailoring YOLOv11 for the unique demands of China UAV drone-captured highway images, focusing on lightweight yet effective module designs to enhance multi-scale detection capabilities.
Methodology
Our proposed algorithm modifies the YOLOv11 architecture to better handle pothole detection in aerial images. The original YOLOv11 consists of four parts: Input, Backbone, Neck, and Head. We retain the overall framework but introduce two novel modules: the LEDM in the Backbone and the EMSA in the Neck. These modules are designed to improve feature extraction and fusion, respectively, addressing the multi-scale and weak-feature challenges prevalent in China UAV drone imagery.
Backbone Improvement with LEDM Module
The Backbone is responsible for extracting hierarchical features from input images. In standard YOLOv11, the C3K2 module is used for feature extraction, but it may not adequately capture the subtle features of small potholes or effectively suppress background noise. We replace C3K2 with our proposed Lightweight Enhanced Detection Module (LEDM_module), which employs grouped parallel processing and adaptive feature enhancement to boost performance while reducing computational overhead.
The LEDM_module processes an input feature map $$F_{in} \in \mathbb{R}^{H \times W \times C}$$ through a series of operations. First, a 1×1 convolution adjusts the channel dimensions:
$$F_{pre} = \text{Conv}_{1 \times 1}(F_{in}; K_{pre}, b_{pre}) \in \mathbb{R}^{H \times W \times C’}$$
where $$C’ = g \cdot c$$, with $$g$$ as the number of groups and $$c$$ as channels per group. This is followed by a split operation that divides $$F_{pre}$$ into $$g$$ sub-feature maps:
$$F_{pre} \rightarrow \{F_{split,1}, F_{split,2}, \dots, F_{split,g}\}$$
Each sub-feature map is then processed by an Adaptive_Module, which consists of two Adaptive_Efficient_Conv sub-modules in parallel. The Adaptive_Efficient_Conv utilizes adaptive average pooling, multi-branch attention, and lightweight convolutions to dynamically enhance pothole features. Specifically, for a sub-feature map $$F_{split,i}$$, adaptive average pooling yields:
$$X_{pool,i} = \text{AdaptiveAvgPool}(F_{split,i})$$
A 1×1 convolution reduces dimensions:
$$X_{conv,i} = \text{Conv}_{1 \times 1}(X_{pool,i}; K_{conv}, b_{conv})$$
The attention weights are computed via Softmax:
$$X_{softmax,i} = \text{Softmax}(X_{conv,i})$$
These weights guide three convolutional branches with kernels of sizes 1×M, M×1, and K×K, producing enhanced features:
$$
\begin{aligned}
X_{branch1,i} &= F_{split,i} \odot \text{Expand}(\text{Conv}_{1 \times M}(X_{softmax,i})) \\
X_{branch2,i} &= F_{split,i} \odot \text{Expand}(\text{Conv}_{M \times 1}(X_{softmax,i})) \\
X_{branch3,i} &= F_{split,i} \odot \text{Expand}(\text{Conv}_{K \times K}(X_{softmax,i}))
\end{aligned}
$$
The outputs are summed:
$$X_{adaptive,i} = X_{branch1,i} + X_{branch2,i} + X_{branch3,i}$$
In the Adaptive_Module, two such Adaptive_Efficient_Conv outputs are concatenated, passed through a 1×1 convolution, and then refined via a depthwise convolution (DWConv) and activation to generate attention-calibrated features. A residual connection adds the original sub-feature map, yielding the enhanced output $$F_{adaptive,i}$$. All group outputs are concatenated and fused with a final 1×1 convolution to produce the LEDM_module output $$F_{LEDM}$$.
The LEDM_module reduces parameters and computations by leveraging grouped convolutions and depthwise operations, making it suitable for deployment on China UAV drone platforms with limited processing power. The adaptive mechanism ensures that pothole features are emphasized while irrelevant background information is suppressed, improving detection accuracy for both small and large potholes.
Neck Improvement with EMSA Module
The Neck integrates multi-scale features from the Backbone to capture both semantic and detailed information. In YOLOv11, a unidirectional feature pyramid network is used, but it may inefficiently fuse features across scales, leading to information loss for small targets. We introduce the Enhanced Multi-scale Attention Fusion Module (EMSA_module) to replace the original fusion blocks. The EMSA_module employs dynamic attention calibration, grouped spatial refinement, and residual fusion to enhance cross-scale information flow.
Given multi-scale features from the Backbone—shallow $$F_{ledm1}$$, middle $$F_{ledm2}$$, and deep $$F_{backbone}$$—the EMSA_module first aligns their spatial dimensions via upsampling or downsampling operations. The aligned features are concatenated:
$$F_{concat} = \text{Concat}(\text{Adown}(F_{ledm1}), \text{Conv}_{1 \times 1}(F_{ledm2}), \text{Upsample}(F_{backbone}))$$
Next, four parallel depthwise separable convolution branches with kernel sizes 5, 7, 9, and 11 process $$F_{concat}$$ to capture multi-scale spatial contexts:
$$F_{fusion} = \sum_{k \in \{5,7,9,11\}} \text{DWConv}_{k \times k}(F_{concat})$$
This is followed by additional convolution and addition operations to refine the fused features. The output $$F_{EMSA}$$ is then upsampled and concatenated with features from other scales, passed through a GSConv (Grouped Spatial Convolution) module for further refinement, and fed into C3K2 blocks to generate final feature maps for the detection head.
The EMSA_module ensures that small pothole edges and large pothole contours are effectively integrated, reducing dilution of weak features and confusion with background elements. This is critical for aerial images from China UAV drones, where potholes exhibit significant scale variations due to varying flight altitudes and camera angles.
Overall Architecture and Formulations
The complete modified YOLOv11 architecture integrates the LEDM and EMSA modules. The input image $$I \in \mathbb{R}^{H_0 \times W_0 \times 3}$$ is resized to 640×640 pixels. The Backbone, enhanced with LEDM_modules, extracts features at multiple scales. Let $$F_{backbone}^{(l)}$$ denote the feature map at layer $$l$$. The Neck, with EMSA_modules, fuses these features to produce outputs $$P_3$$, $$P_4$$, and $$P_5$$ at different resolutions for the detection head. The head uses anchor-free predictions to directly regress bounding boxes and class probabilities.
The loss function combines classification and localization losses. We use the Complete IoU (CIoU) loss for bounding box regression, which accounts for overlap, center distance, and aspect ratio:
$$\mathcal{L}_{CIoU} = 1 – IoU + \frac{\rho^2(b, b^{gt})}{c^2} + \alpha v$$
where $$IoU$$ is the intersection over union, $$\rho$$ is the Euclidean distance between box centers, $$c$$ is the diagonal length of the minimum enclosing box, $$v$$ measures aspect ratio consistency, and $$\alpha$$ is a weighting factor. The total loss is:
$$\mathcal{L}_{total} = \lambda_{cls} \mathcal{L}_{cls} + \lambda_{box} \mathcal{L}_{CIoU}$$
where $$\mathcal{L}_{cls}$$ is the focal loss for classification to handle class imbalance, and $$\lambda_{cls}$$, $$\lambda_{box}$$ are balancing weights.
To summarize, our method enhances YOLOv11 with lightweight and attention-based modules tailored for pothole detection in China UAV drone imagery. The LEDM_module improves feature extraction efficiency, while the EMSA_module boosts multi-scale fusion, together addressing key challenges in aerial inspection.
Experiments and Results
We conducted extensive experiments to evaluate the proposed algorithm. This section describes the dataset, experimental setup, evaluation metrics, and results, including ablation studies and comparisons with state-of-the-art models.
Dataset
We collected aerial videos of highways in China using a DJI Mavic 3 UAV, flying at altitudes between 40 and 80 meters. The dataset comprises 975 images with 2,323 annotated pothole instances. Images were split into training (80%), validation (10%), and test (10%) sets. The dataset includes diverse scenarios, such as dry pavement under normal lighting, with potholes varying in size from small (∼10 cm diameter) to large (∼50 cm diameter). Annotation distribution is visualized through heatmaps and scatter plots, showing spatial density and scale diversity of potholes. This dataset simulates real-world conditions encountered by China UAV drone inspection systems.
Experimental Setup
Experiments were performed on a Linux system with an NVIDIA GeForce GTX 4090 GPU and an Intel Core i7-12700KF CPU. We used PyTorch 1.10.1 and CUDA 12.0. The model was trained for 200 epochs with an input size of 640×640, batch size of 16, momentum of 0.937, weight decay of 0.0005, and an initial learning rate of 0.01 using the SGD optimizer. Data augmentation included random flipping, rotation, and color jittering to improve generalization.
Evaluation Metrics
We adopted six metrics to assess performance: Precision (P), Recall (R), mAP@0.5, mAP@0.5:0.95, number of parameters (Params), and Giga Floating Point Operations (GFLOPs). Precision and recall are defined as:
$$P = \frac{TP}{TP + FP} \times 100\%, \quad R = \frac{TP}{TP + FN} \times 100\%$$
where TP, FP, and FN denote true positives, false positives, and false negatives, respectively. mAP is the mean average precision over multiple IoU thresholds, computed as:
$$AP = \int_0^1 P(r) \, dr, \quad mAP = \frac{1}{N} \sum_{i=1}^N AP_i \times 100\%$$
where $$P(r)$$ is the precision-recall curve, and $$N$$ is the number of classes (here, $$N=1$$ for potholes). Params and GFLOPs measure model complexity and computational cost, crucial for deployment on China UAV drone platforms with limited resources.
Ablation Study
We performed ablation experiments to analyze the contribution of each proposed module. The baseline is YOLOv11n. Results are shown in Table 1.
| Model | LEDM | EMSA | P (%) | R (%) | mAP@0.5 (%) | mAP@0.5:0.95 (%) | Params (M) | GFLOPs |
|---|---|---|---|---|---|---|---|---|
| Baseline | × | × | 86.4 | 69.1 | 81.9 | 52.2 | 2.58 | 6.3 |
| Model A | √ | × | 84.1 | 73.2 | 83.9 | 53.2 | 2.45 | 6.1 |
| Model B | × | √ | 87.2 | 74.5 | 85.9 | 57.4 | 2.59 | 7.5 |
| Our Model | √ | √ | 83.5 | 82.7 | 86.6 | 58.3 | 2.47 | 7.2 |
The baseline achieves 81.9% mAP@0.5 and 52.2% mAP@0.5:0.95, with recall of 69.1%, indicating room for improvement in detecting positive samples. Adding only LEDM (Model A) increases recall by 4.1 percentage points and reduces Params by 5.0%, demonstrating its effectiveness in lightweight feature extraction. Adding only EMSA (Model B) boosts mAP@0.5:0.95 by 5.2 percentage points but increases GFLOPs due to multi-scale processing. Our full model, combining both modules, achieves the highest recall (82.7%) and mAP@0.5:0.95 (58.3%), with a 4.3% reduction in Params compared to baseline. This confirms the synergy between LEDM and EMSA: LEDM enhances feature extraction efficiency, while EMSA improves multi-scale fusion, together optimizing performance for China UAV drone-based detection.
Comparison with State-of-the-Art Models
We compared our model with two popular lightweight detectors: YOLOv5n and YOLOv8n. Results are presented in Table 2.
| Model | P (%) | R (%) | mAP@0.5 (%) | mAP@0.5:0.95 (%) | Params (M) | GFLOPs |
|---|---|---|---|---|---|---|
| YOLOv5n | 82.0 | 60.6 | 74.2 | 42.0 | 2.50 | 7.1 |
| YOLOv8n | 80.8 | 60.2 | 72.3 | 41.6 | 3.00 | 8.1 |
| Our Model | 83.5 | 82.7 | 86.6 | 58.3 | 2.47 | 7.2 |
Our model outperforms both YOLOv5n and YOLOv8n across all accuracy metrics. Specifically, it achieves a 12.4 percentage point higher recall than YOLOv5n and a 22.5 percentage point higher recall than YOLOv8n, indicating significantly fewer missed detections. The mAP@0.5:0.95 is 16.3 and 16.7 percentage points higher than YOLOv5n and YOLOv8n, respectively, demonstrating robust detection across varying IoU thresholds. In terms of complexity, our model has fewer parameters than YOLOv8n and comparable GFLOPs to YOLOv5n, making it suitable for real-time applications on China UAV drone systems. These results highlight the effectiveness of our modifications in addressing the challenges of aerial pothole detection.
Visualization and Analysis
We provide qualitative results to illustrate detection performance in different scenarios. For instance, in images with small, isolated potholes, the baseline model often fails to detect them or yields low confidence scores, whereas our model successfully identifies them with higher confidence. In complex backgrounds with multiple potholes of varying sizes, our model detects more instances with consistent confidence, reducing both false negatives and false positives. These visualizations align with quantitative metrics, confirming that our algorithm enhances detection capability for weak-feature targets in China UAV drone imagery.
To further analyze the impact of our modules, we examine feature maps before and after LEDM and EMSA processing. The LEDM_module produces more focused activations on pothole regions, suppressing irrelevant textures like road markings or shadows. The EMSA_module integrates features across scales, preserving details of small potholes while capturing context for large ones. This is crucial for aerial images, where pothole scales can vary dramatically within a single frame due to perspective effects from the China UAV drone’s camera.
Discussion
The improvements achieved by our model can be attributed to several factors. First, the LEDM_module’s grouped parallel processing allows for efficient extraction of multi-scale features, which is essential given the diverse sizes of potholes in aerial views. The adaptive enhancement mechanism dynamically amplifies pothole-related features while minimizing background interference, addressing the low signal-to-noise ratio common in China UAV drone images. Second, the EMSA_module’s multi-branch depthwise convolutions and attention calibration enable effective fusion of spatial and semantic information across scales, reducing information loss for small targets. This is particularly important for ensuring that tiny potholes, which may occupy only a few pixels, are not overlooked during feature propagation.
From a practical standpoint, the reduced parameter count and manageable GFLOPs make our model deployable on edge devices mounted on China UAV drones, enabling real-time inspection without relying on cloud computing. This aligns with the growing trend of autonomous UAV-based monitoring in smart transportation systems. However, limitations exist: our dataset primarily consists of dry pavement under normal lighting conditions. Future work should incorporate more varied scenarios, such as wet roads, low-light environments, or different weather conditions, to enhance model robustness. Additionally, integrating temporal information from video sequences could further improve detection accuracy by leveraging motion cues.
Conclusion
In this article, we presented an improved YOLOv11 algorithm for detecting potholes in aerial images captured by China UAV drone systems. The proposed modifications include a Lightweight Enhanced Detection Module (LEDM) in the backbone and an Enhanced Multi-scale Attention Fusion Module (EMSA) in the neck. These modules work synergistically to enhance feature extraction and fusion for multi-scale, weak-feature potholes, addressing key challenges in UAV-based highway inspection. Experimental results on a custom dataset demonstrate significant improvements over baseline YOLOv11n and other lightweight models, with higher recall and mAP, while maintaining low computational complexity. Our work contributes to the advancement of intelligent road maintenance, offering a reliable tool for automated pothole detection using China UAV drone technology. Future research will focus on expanding the dataset to include more diverse conditions and exploring real-time deployment on UAV platforms for large-scale highway monitoring.
The integration of deep learning with UAVs holds great promise for transforming infrastructure inspection. As China continues to invest in its transportation network, efficient and accurate detection systems will become increasingly vital. Our algorithm provides a step forward in this direction, enabling safer and more sustainable road management through the use of advanced China UAV drone platforms.
