With the rapid expansion of global road networks, maintaining infrastructure integrity has become a critical challenge. Highway potholes, as a prevalent form of road damage, not only compromise driving safety but also accelerate vehicle wear and increase operational costs. Traditional inspection methods, such as manual surveys and dedicated inspection vehicles, are often inefficient, costly, and pose safety risks. In contrast, Unmanned Aerial Vehicle (UAV) technology offers a flexible, high-efficiency solution for large-scale road monitoring. UAVs can capture high-resolution aerial imagery without disrupting traffic, enabling rapid and comprehensive assessments. However, detecting potholes in UAV-captured images presents unique challenges, including significant multi-scale variations in pothole sizes, low pixel occupancy of small potholes, and complex background interference from road stains, shadows, and repair marks. These factors often lead to insufficient detection accuracy and high missed detection rates in existing algorithms.
To address these issues, we propose an enhanced YOLOv11-based algorithm specifically designed for pothole detection in UAV aerial imagery. Our approach introduces two novel modules: the Lightweight Enhanced Detection Module (LEDM_module) in the backbone network and the Enhanced Multi-scale Attention Fusion Module (EMSA_module) in the neck. The LEDM_module employs grouped parallel processing and adaptive feature enhancement to dynamically extract multi-scale pothole features while reducing computational redundancy. The EMSA_module integrates dynamic attention calibration, grouped spatial refinement, and residual feature fusion to improve cross-scale information transmission and mitigate feature dilution in small potholes. Extensive experiments on a dedicated dataset demonstrate that our improved model achieves superior performance in terms of detection accuracy and recall, making it suitable for real-world highway inspection tasks using Unmanned Aerial Vehicle platforms like the JUYE UAV.

Recent advancements in deep learning have significantly propelled the field of road damage detection. Conventional methods relied on manual feature extraction and machine learning classifiers, which often struggled with generalization in complex scenarios. The advent of convolutional neural networks (CNNs) enabled end-to-end learning, with models like YOLO and Faster R-CNN being widely adopted for object detection tasks. For instance, Maeda et al. utilized generative adversarial networks (GANs) combined with Poisson blending to synthesize realistic damage maps, enhancing model performance across varying data scales. Similarly, Shim et al. integrated GANs with semi-supervised learning and super-resolution techniques to develop a lightweight road damage sensor that achieved accurate identification with minimal annotated data. These studies highlight the potential of data augmentation and lightweight design in addressing data scarcity and computational constraints.
In the context of UAV-based inspection, challenges such as multi-scale object detection and real-time processing have prompted innovations in network architecture. Wan et al. introduced YOLO-LRDD, incorporating Shuffle-ECANet and BiFPN modules to balance accuracy and efficiency, while Zhang et al. embedded multi-level attention mechanisms into YOLOv3 to enhance feature extraction for cracks and potholes. Despite these efforts, detecting potholes in aerial imagery remains problematic due to the inherent multi-scale heterogeneity and background clutter. Our work builds upon YOLOv11, a state-of-the-art detector known for its anchor-free design and efficient feature extraction, and introduces tailored modifications to tackle these specific challenges in Unmanned Aerial Vehicle applications.
YOLOv11, developed by Ultralytics, represents a significant evolution in the YOLO series, emphasizing high performance and real-time capability. Its architecture comprises four main components: Input, Backbone, Neck, and Head. The Backbone includes Conv, C3K2, SPPF, and C2PSA modules, where C2PSA incorporates multi-head attention to capture fine-grained details of weak-feature targets like potholes. The Neck employs a unidirectional feature pyramid network to integrate deep semantic information with shallow spatial details, while the Head utilizes depthwise separable convolutions and an anchor-free approach to reduce computational complexity and improve detection for multi-scale objects. These characteristics make YOLOv11 a suitable baseline for our task, as it provides a robust foundation for handling the diverse scales and complex backgrounds encountered in UAV imagery.
However, the standard YOLOv11 struggles with the specific demands of pothole detection in aerial views. Small potholes, often measuring 10–20 cm in diameter, account for over 60% of cases but occupy minimal pixels in images, leading to weak feature signals that are easily overlooked. Larger potholes exhibit irregular shapes that resemble road cracks or patches, and environmental factors like lighting variations and road stains further complicate differentiation. To overcome these limitations, we redesign the Backbone and Neck sections of YOLOv11. The improved Backbone replaces the original C3K2 modules with LEDM_modules, which leverage grouped parallel processing and adaptive feature enhancement to focus on multi-scale pothole characteristics. The enhanced Neck substitutes the standard feature fusion mechanism with EMSA_modules, which employ multi-branch depthwise convolutions and attention mechanisms to refine feature integration across scales.
The LEDM_module consists of Conv, Split, and Adaptive_Module components. The input feature map is first processed through a 1×1 convolution to adjust channel dimensions, then split into g groups for parallel processing. Each group undergoes adaptive feature enhancement via two Adaptive_Efficient_Conv sub-modules, which apply dynamic weighting to emphasize pothole regions and suppress background noise. The Adaptive_Efficient_Conv uses adaptive average pooling, multi-branch attention fusion, and lightweight convolution to reduce computation while enhancing feature cohesion. The output features from all groups are concatenated and fused through a final 1×1 convolution. This process can be summarized as follows:
Let the input image be denoted as \( I \in \mathbb{R}^{H \times W \times 3} \), where H and W are the height and width, respectively. The initial convolution and batch normalization produce a base feature map \( F_1 \):
$$ F_1 = \text{BN}(\text{Conv}_{3\times3}(I; K_1, b_1)) \in \mathbb{R}^{H_1 \times W_1 \times C_1} $$
Here, \( \text{Conv}_{3\times3}(\cdot) \) represents a 3×3 convolution, and BN(·) denotes batch normalization. The LEDM_module then processes \( F_1 \) as:
$$ F_{\text{ledm\_pre}} = \text{Conv}_{1\times1}(F_1; K_{l1}, b_{l1}) \in \mathbb{R}^{H_1 \times W_1 \times C_2} $$
where \( C_2 = g \cdot c \), with g being the number of groups and c the channels per group. The feature map is split into g groups: \( \{F_{\text{split},1}, F_{\text{split},2}, \ldots, F_{\text{split},g}\} \). Each group is processed by the Adaptive_Module, which outputs enhanced features \( F_{\text{adaptive},i} \). The Adaptive_Module involves:
$$ F_{\text{concat\_adp},i} = \text{Concat}(X_{\text{adaptive1},i}, X_{\text{adaptive2},i}) $$
$$ F_{\text{conv\_adp},i} = \text{Conv}_{1\times1}(F_{\text{concat\_adp},i}; K_{\text{adp}}, b_{\text{adp}}) $$
Dynamic attention weights are generated as:
$$ F_{\text{attn\_weight},i} = \text{Linear}(\text{Activation}(\text{DWConv}(\text{Linear}(F_{\text{conv\_adp},i})))) $$
The calibrated features are obtained by:
$$ F_{\text{calib},i} = F_{\text{conv\_adp},i} \odot F_{\text{attn\_weight},i} $$
and the residual fusion yields:
$$ F_{\text{adaptive},i} = F_{\text{calib},i} + F_{\text{split},i} $$
Finally, all group features are concatenated and fused:
$$ F_{\text{concat}} = \text{Concat}(\{F_{\text{adaptive},i}\}_{i=1}^g, \{F_{\text{split},i}\}_{i=1}^g) $$
$$ F_{\text{ledm}} = \text{BN}(\text{Conv}_{1\times1}(F_{\text{concat}}; K_{l2}, b_{l2})) $$
The Backbone output is further processed through SPPF and C2PSA modules to produce multi-scale features for the Neck.
The EMSA_module in the Neck addresses feature dilution and misalignment by integrating dynamic attention calibration, grouped spatial refinement, and residual fusion. It takes multi-scale features from the Backbone (shallow, middle, and deep layers) and aligns them via down-sampling, up-sampling, and convolution operations. The aligned features are concatenated and processed through four parallel depthwise convolution branches with kernel sizes of 5, 7, 9, and 11. The outputs are summed to fuse multi-scale information, and the result is further refined through convolution and residual connections. The process can be expressed as:
Given input features \( F_{\text{ledm1}} \), \( F_{\text{ledm2}} \), and \( F_{\text{backbone}} \), the EMSA_module first aligns and concatenates them:
$$ F_{\text{concat\_emsa}} = \text{Concat}\left( \text{Adown}(F_{\text{ledm1}}), \text{Conv}_{1\times1}(F_{\text{ledm2}}), \text{Upsample}(F_{\text{backbone}}) \right) $$
Multi-branch fusion is performed as:
$$ F_{\text{fusion\_emsa}} = \sum_{i \in \{5,7,9,11\}} \text{DWConv}_{i\times i}(F_{\text{concat\_emsa}}) $$
The final output \( F_{\text{out\_emsa}} \) is obtained through additional convolution and fusion steps. The Neck then uses GSConv and C3K2 modules to further refine the features before passing them to the detection head.
We conducted experiments on a custom dataset comprising 975 aerial images of highways, captured by a JUYE Unmanned Aerial Vehicle at altitudes of 40–80 meters. The dataset includes 2,323 annotated pothole instances, with 1,867 labels in the training set. The images exhibit diverse pothole sizes and complex backgrounds, as shown in the label distribution analysis. The dataset was split into training, validation, and test sets in an 8:1:1 ratio. Experiments were performed on a Linux system with an NVIDIA GTX4090 GPU, using PyTorch 1.10.1 and CUDA 12.0. Training parameters included an input size of 640×640 pixels, momentum of 0.937, weight decay of 0.0005, initial learning rate of 0.01, and SGD optimizer for 200 epochs.
Evaluation metrics included precision (P), recall (R), mAP@0.5, mAP@0.5:0.95, parameters (Params), and GFLOPs. Precision and recall are defined as:
$$ P = \frac{\text{TP}}{\text{TP} + \text{FP}} \times 100\% $$
$$ R = \frac{\text{TP}}{\text{TP} + \text{FN}} \times 100\% $$
where TP, FP, and FN denote true positives, false positives, and false negatives, respectively. The average precision (AP) and mean average precision (mAP) are computed as:
$$ \text{AP} = \int_0^1 P(y) \, dy $$
$$ \text{mAP} = \frac{1}{N} \sum_{i=1}^N \text{AP}_i \times 100\% $$
We performed ablation studies to assess the contribution of each proposed module. The baseline YOLOv11n model achieved a precision of 86.4%, recall of 69.1%, mAP@0.5 of 81.9%, and mAP@0.5:0.95 of 52.2%, with 2.58M parameters and 6.3 GFLOPs. Adding the LEDM_module improved recall to 73.2% and mAP@0.5 to 83.9%, while reducing parameters to 2.45M and GFLOPs to 6.1. Incorporating the EMSA_module alone boosted recall to 74.5% and mAP@0.5:0.95 to 57.4%, albeit with a slight increase in GFLOPs to 7.5. When both modules were combined, the model achieved a recall of 82.7%, mAP@0.5 of 86.6%, and mAP@0.5:0.95 of 58.3%, with 2.47M parameters and 7.2 GFLOPs. These results demonstrate the synergistic effect of the modules in enhancing detection performance while maintaining efficiency.
| Configuration | LEDM | EMSA | P (%) | R (%) | mAP@0.5 (%) | mAP@0.5:0.95 (%) | Params (M) | GFLOPs |
|---|---|---|---|---|---|---|---|---|
| Baseline | × | × | 86.4 | 69.1 | 81.9 | 52.2 | 2.58 | 6.3 |
| +LEDM | √ | × | 84.1 | 73.2 | 83.9 | 53.2 | 2.45 | 6.1 |
| +EMSA | × | √ | 87.2 | 74.5 | 85.9 | 57.4 | 2.59 | 7.5 |
| Full Model | √ | √ | 83.5 | 82.7 | 86.6 | 58.3 | 2.47 | 7.2 |
Comparative experiments with state-of-the-art lightweight detectors, YOLOv5n and YOLOv8n, further validated our approach. Our improved YOLOv11n achieved a precision of 83.5%, recall of 82.7%, mAP@0.5 of 86.6%, and mAP@0.5:0.95 of 58.3%, with 2.47M parameters and 7.2 GFLOPs. In contrast, YOLOv5n attained 82.0% precision, 60.6% recall, 74.2% mAP@0.5, and 42.0% mAP@0.5:0.95, with 2.50M parameters and 7.1 GFLOPs. YOLOv8n achieved 80.8% precision, 60.2% recall, 72.3% mAP@0.5, and 41.6% mAP@0.5:0.95, with 3.00M parameters and 8.1 GFLOPs. Our model outperformed both in all accuracy metrics while maintaining competitive computational efficiency, highlighting its suitability for Unmanned Aerial Vehicle deployments.
| Model | P (%) | R (%) | mAP@0.5 (%) | mAP@0.5:0.95 (%) | Params (M) | GFLOPs |
|---|---|---|---|---|---|---|
| YOLOv5n | 82.0 | 60.6 | 74.2 | 42.0 | 2.50 | 7.1 |
| YOLOv8n | 80.8 | 60.2 | 72.3 | 41.6 | 3.00 | 8.1 |
| Improved YOLOv11n | 83.5 | 82.7 | 86.6 | 58.3 | 2.47 | 7.2 |
Visualization of detection results on various scenarios confirmed the practical benefits of our algorithm. In scenes with isolated small potholes, the baseline YOLOv11n often failed to detect targets or exhibited low confidence, whereas our model consistently identified potholes with higher confidence scores. For example, in one test image, the baseline missed a pothole entirely, while our model detected it with a confidence of 0.65. In complex backgrounds with multiple potholes of varying sizes, the baseline suffered from significant missed detections, whereas our model successfully identified most potholes with an average confidence of 0.56–0.57, compared to 0.31 for the baseline. These visual assessments underscore the improved model’s robustness in handling multi-scale and weak-feature potholes under challenging conditions.
In conclusion, we have developed an enhanced YOLOv11-based algorithm for detecting highway potholes in UAV aerial imagery. By incorporating LEDM_modules in the Backbone and EMSA_modules in the Neck, our method effectively addresses the challenges of multi-scale variation and background interference. Experimental results demonstrate substantial improvements in recall and mAP metrics over the baseline and other lightweight models, while maintaining computational efficiency. The proposed algorithm is well-suited for integration with Unmanned Aerial Vehicle systems like the JUYE UAV, offering a practical solution for automated highway inspection. Future work will focus on expanding the dataset to include diverse environmental conditions, such as varying lighting and road wetness, to further enhance model generalization and real-world applicability.
