An Enhanced YOLOv11 Algorithm for Highway Pothole Detection from Aerial Imagery Using UAVs in China

The extensive national highway network in China, exceeding 5.4 million kilometers by the end of 2024, faces continuous deterioration due to traffic loads and environmental factors. Potholes, as a prevalent type of pavement distress, critically compromise road smoothness, driving safety, and vehicle operating costs. Traditional inspection methods, such as manual surveys or specialized inspection vehicles, are often inefficient, costly, and pose safety risks. The China UAV drone technology offers a transformative solution. Unmanned Aerial Vehicles provide a flexible, cost-effective platform for capturing high-resolution aerial imagery without disrupting traffic flow, enabling frequent and large-scale inspections.

Integrating UAV drone imagery with advanced deep learning-based object detection algorithms allows for the rapid and automated identification and localization of potholes. However, this task presents significant challenges. Potholes in aerial images exhibit large intra-class scale variation, with small potholes (e.g., 10-20 cm in diameter) constituting a majority but occupying very few pixels, resulting in weak feature signals. Conversely, large, irregularly shaped potholes can be visually confused with road cracks, patches, or stains. Furthermore, complex backgrounds, varying lighting, and shadows introduce substantial interference. Conventional detection models struggle with these issues, leading to insufficient accuracy, high rates of missed detections, and low operational efficiency. To address these challenges, this paper proposes an improved YOLOv11-based algorithm specifically designed for robust pothole detection in aerial imagery captured by China UAV platforms.

Building upon the efficient YOLOv11 architecture, our proposed model introduces two novel modules to enhance feature extraction and fusion capabilities tailored for the pothole detection task. The original C3K2 feature extraction modules in the backbone network are replaced with a Lightweight Enhanced Detection Module (LEDM). Concurrently, the feature fusion path in the neck is augmented with an Enhanced Multi-scale Attention Fusion Module (EMSA). These modifications work synergistically to improve the model’s sensitivity to multi-scale weak-feature targets and its discriminative power in complex scenes.

1. Methodology

1.1 Network Architecture Overview

The overall architecture of our proposed model is built upon the YOLOv11 framework, renowned for its balance between speed and accuracy. The key improvements are concentrated in the Backbone and Neck components to tackle the specific difficulties of UAV drone pothole imagery. The modified network accepts an input image $I \in \mathbb{R}^{H \times W \times 3}$. The Backbone, enhanced with LEDM modules, progressively extracts and refines multi-scale pothole features. The Neck, incorporating the EMSA module, effectively fuses these features across different semantic levels and resolutions. Finally, the detection head predicts bounding boxes and class confidence for potholes. The core innovations lie in the design of the LEDM and EMSA modules, which are described in detail below.

1.2 Lightweight Enhanced Detection Module (LEDM)

The LEDM module is designed to replace the standard C3K2 blocks in the backbone. Its purpose is to perform efficient, targeted feature extraction for potholes of varying scales while suppressing irrelevant background information, a common issue in China UAV road surveys. The module employs a strategy of grouped parallel processing followed by adaptive feature enhancement.

The processing flow within an LEDM module can be formulated as follows. First, the input feature map $F_{in}$ undergoes a pointwise convolution for channel adjustment:
$$F_{adj} = \text{BN}(\text{Conv}_{1\times1}(F_{in}; K_{adj}, b_{adj})) \in \mathbb{R}^{H \times W \times C_{g}}$$
where $C_g = g \cdot c$, with $g$ being the number of groups and $c$ the channels per group.

This adjusted feature map is then split into $g$ groups along the channel dimension:
$$F_{adj} \rightarrow \{F_{split,1}, F_{split,2}, \dots, F_{split,g}\}$$
Each group $F_{split,i}$ is processed independently by an Adaptive Module. This module consists of two parallel Adaptive_Efficient_Conv (AEC) sub-modules. The AEC sub-module is the core of adaptive enhancement. For an input group feature $X$, it first computes a channel-wise attention vector:
$$z = \text{AdaptiveAvgPool}(X) \in \mathbb{R}^{1 \times 1 \times c}$$
$$a = \text{Softmax}(\text{Conv}_{1\times1}(z)) \in \mathbb{R}^{1 \times 1 \times c’}$$
This attention vector $a$ is then used to guide three parallel convolutional branches with kernels of size $1 \times M$, $M \times 1$, and $K \times K$, capturing features from different spatial perspectives. The outputs are summed:
$$X_{aec} = \sum_{j \in \{1\times M, M\times 1, K\times K\}} (X \odot \text{Expand}(\text{Conv}_j(a)))$$
where $\odot$ denotes element-wise multiplication and $\text{Expand}()$ broadcasts the vector to the spatial dimensions of $X$.

The two AEC outputs for group $i$ are concatenated, fused via a $1\times1$ convolution, and then calibrated by a dynamic attention weight generated through a lightweight path consisting of depthwise convolution (DWConv) and linear layers:
$$F_{calib,i} = F_{fused,i} \odot \text{Linear}(\text{Activation}(\text{DWConv}(\text{Linear}(F_{fused,i}))))$$
A residual connection is added to preserve the original information:
$$F_{adaptive,i} = F_{calib,i} + F_{split,i}$$
Finally, all enhanced group features $\{F_{adaptive,i}\}_{i=1}^g$ are concatenated with the original split features $\{F_{split,i}\}_{i=1}^g$ and fused through a final $1\times1$ convolution to produce the output of the LEDM module, $F_{ledm}$. This design ensures focused, efficient extraction of pothole features critical for analysis of UAV drone imagery.

1.3 Enhanced Multi-scale Attention Fusion Module (EMSA)

While the Backbone extracts good features, effectively fusing them across scales in the Neck is vital. The original feature pyramid network can lose fine details from shallow layers, leading to missed small potholes. Our proposed EMSA module enhances multi-scale fusion through dynamic attention calibration and multi-branch spatial refinement.

The EMSA module takes multi-level features from the backbone (e.g., shallow $F_s$, middle $F_m$, deep $F_d$). These features are first aligned in spatial resolution through down-sampling or up-sampling operations and concatenated:
$$F_{cat} = \text{Concat}(\text{Down}(F_s), \text{Conv}_{1\times1}(F_m), \text{Upsample}(F_d))$$

The core fusion employs four parallel depthwise separable convolution (DWConv) branches with different kernel sizes (e.g., 5, 7, 9, 11). This allows the module to capture and refine pothole features at multiple receptive fields simultaneously, which is essential for handling the scale variance in China UAV aerial data.
$$F_{fusion} = \sum_{k \in \{5,7,9,11\}} \text{DWConv}_{k\times k}(F_{cat})$$
The fused feature is then further refined through a convolutional layer to produce the final output $F_{emsa}$. This structure ensures that both the subtle edges of small potholes and the extended contours of large ones are effectively integrated and enhanced before being passed to the detection head.

1.4 Loss Function

The model is trained using a composite loss function standard for YOLO models, which includes classification loss ($L_{cls}$) and bounding box regression loss ($L_{box}$). We utilize the Complete Intersection over Union (CIoU) loss for $L_{box}$ as it considers overlap area, central point distance, and aspect ratio, providing more stable gradients for bounding box regression:
$$L_{total} = \lambda_{box} L_{box} + \lambda_{cls} L_{cls}$$
where $L_{box} = 1 – CIoU$, and $\lambda_{box}$ and $\lambda_{cls}$ are balancing weights.

2. Experiments and Analysis

2.1 Dataset and Implementation Details

The experimental dataset was constructed in collaboration with a transportation bureau in Jilin Province, China. Aerial video of multiple highway segments was captured using a DJI Mavic 3 UAV drone flying at altitudes between 40m and 80m. After careful screening, 975 high-quality images were selected and annotated with “pothole” bounding boxes, resulting in a total of 2,323 instances. The dataset was split into training (80%), validation (10%), and test (10%) sets. The distribution of bounding box sizes and center points is analyzed to understand the data characteristics, confirming the presence of significant scale variation.

All experiments were conducted on a Linux server with an NVIDIA RTX 4090 GPU. The model was trained for 200 epochs with an input size of $640 \times 640$ pixels, using the SGD optimizer with an initial learning rate of 0.01, momentum of 0.937, and weight decay of 0.0005.

2.2 Evaluation Metrics

The performance is evaluated using standard object detection metrics: Precision ($P$), Recall ($R$), mean Average Precision at IoU=0.5 ($mAP@0.5$), and mean Average Precision over IoU thresholds from 0.5 to 0.95 ($mAP@0.5:0.95$). Model complexity is assessed by the number of parameters (Params) and floating-point operations (GFLOPs). The formulas are as follows:
$$P = \frac{TP}{TP + FP} \times 100\%$$
$$R = \frac{TP}{TP + FN} \times 100\%$$
$$mAP = \frac{1}{N} \sum_{i=1}^{N} \left( \int_{0}^{1} P_i(r) dr \right) \times 100\%$$
where $TP$, $FP$, $FN$ are true positives, false positives, and false negatives, $N$ is the number of classes, and $P_i(r)$ is the precision-recall curve for class $i$.

2.3 Ablation Study

To validate the contribution of each proposed module, an ablation study was conducted starting from the baseline YOLOv11n model. The results are summarized in the table below.

Configuration	LEDM	EMSA	P (%)	R (%)	mAP@0.5 (%)	mAP@0.5:0.95 (%)	Params (M)	GFLOPs
Baseline (YOLOv11n)	×	×	86.4	69.1	81.9	52.2	2.58	6.3
Baseline + LEDM	√	×	84.1	73.2	83.9	53.2	2.45	6.1
Baseline + EMSA	×	√	87.2	74.5	85.9	57.4	2.59	7.5
Proposed Model	√	√	83.5	82.7	86.6	58.3	2.47	7.2

The ablation study reveals clear insights. Incorporating only the LEDM module improves Recall and mAP scores while reducing model parameters and GFLOPs, demonstrating its efficiency in feature extraction for UAV drone images. Adding only the EMSA module yields significant gains in Recall and $mAP@0.5:0.95$, indicating its strength in multi-scale feature fusion, albeit with a slight increase in computational cost. The full proposed model, combining both modules, achieves the best overall performance: a Recall of 82.7% (a 19.68% increase over the baseline) and an $mAP@0.5:0.95$ of 58.3% (an 11.69% increase). This synergy confirms that LEDM and EMSA address complementary aspects of the pothole detection problem—efficient feature extraction and robust multi-scale fusion—while maintaining a lightweight profile suitable for practical China UAV applications.

2.4 Comparative Experiments

We compared our improved YOLOv11 model against other mainstream lightweight detectors, YOLOv5n and YOLOv8n, on the same pothole dataset. The results are presented in the following table.

Model	P (%)	R (%)	mAP@0.5 (%)	mAP@0.5:0.95 (%)	Params (M)	GFLOPs
YOLOv5n	82.0	60.6	74.2	42.0	2.50	7.1
YOLOv8n	80.8	60.2	72.3	41.6	3.00	8.1
Improved YOLOv11n (Ours)	83.5	82.7	86.6	58.3	2.47	7.2

The proposed model demonstrates superior performance across all key accuracy metrics. Most notably, it achieves a substantial improvement in Recall (over 22 percentage points higher than YOLOv5n and YOLOv8n) and a significant boost in $mAP@0.5:0.95$ (over 16 percentage points higher). This indicates a greatly reduced rate of missed detections and stronger robustness across different IoU thresholds. Importantly, these gains are achieved with fewer parameters than YOLOv8n and comparable computational cost (GFLOPs) to YOLOv5n, confirming the effectiveness and efficiency of our architectural improvements for the specific task of pothole detection in UAV drone imagery.

2.5 Visual Analysis of Detection Results

Qualitative results on test images further illustrate the model’s capabilities. In scenes with isolated small potholes, the baseline YOLOv11n often fails to detect them or assigns very low confidence, while the proposed model successfully identifies them with higher confidence. In complex scenes with multiple potholes of varying sizes and distracting backgrounds (like stains or patches), the baseline model suffers from severe missed detections. In contrast, the proposed model consistently detects a greater number of potholes, including smaller and more ambiguous ones, with significantly higher confidence scores. This visual evidence aligns with the quantitative metrics, demonstrating that the improved model effectively mitigates the core challenges of scale variation and background interference prevalent in aerial surveys conducted by China UAV drones.

3. Conclusion

This paper presents an improved YOLOv11-based algorithm designed to address the specific challenges of highway pothole detection in aerial imagery captured by UAV drones. The key contributions are the design and integration of two novel modules: the Lightweight Enhanced Detection Module (LEDM) for efficient and adaptive multi-scale feature extraction in the backbone, and the Enhanced Multi-scale Attention Fusion Module (EMSA) for robust cross-scale information integration in the neck. Extensive experiments on a dedicated dataset demonstrate that the proposed model significantly outperforms the baseline YOLOv11n and other state-of-the-art lightweight detectors like YOLOv5n and YOLOv8n. It achieves marked improvements in Recall and mAP metrics, indicating a substantial reduction in missed detections and enhanced overall accuracy for multi-scale potholes. The model maintains a lightweight profile, making it well-suited for practical deployment in intelligent highway inspection systems utilizing China UAV drone technology. Future work will focus on expanding the dataset to include a wider variety of weather and pavement conditions to further improve the model’s generalization capability for nationwide application.