Multi-Object Detection for Daily Road Maintenance Inspection with UAV in China

The scale of the highway network in China has been expanding rapidly, forming the world’s largest comprehensive transportation system. Concurrently, the demand for road maintenance continues to accumulate and evolve, shifting the focus of highway development from primarily construction to a balanced emphasis on both construction and maintenance. As of recent statistics, the mileage of maintained highways has reached millions of kilometers, accounting for over 99% of the total national highway mileage. Traditional road inspection methods, which rely heavily on manual labor, are plagued by high workload, low efficiency, delayed response, and operational interference with traffic. These shortcomings make it increasingly difficult to meet the requirements for精细化, efficient, and intelligent highway operation and maintenance. Unmanned Aerial Vehicles (UAVs), or drones, offer significant advantages such as rapid deployment, aerial overview, flexible mobility, and non-contact operation. These capabilities can markedly improve inspection coverage and efficiency while reducing labor costs. UAVs have found extensive application in various fields including power line and pipeline inspection, slope reconstruction, traffic flow statistics, road marking detection, rock mass assessment, and bridge inspection. UAVs enable rapid detection and feedback on the condition of highway assets. However, the accuracy of assessing pavement distress and facility damage remains a key technical challenge, limited by the diverse target types, large scale variations, and complex backgrounds present in UAV-captured road images in China.

Current deep learning-based object detection algorithms can be primarily categorized into two groups: CNN-based and Transformer-based detection. Within CNN-based methods, two-stage detectors like the R-CNN family (e.g., R-CNN, Faster R-CNN) first generate region proposals. Conversely, single-stage detectors like the YOLO series and SSD perform detection directly. For instance, adaptations of YOLOv3 and YOLOv5 have been applied to road crack detection with reported high precision. Transformer-based models, such as DETR and its variants like DINO, have also been explored for civil infrastructure inspection, offering end-to-end detection pipelines. Despite progress, challenges persist for China UAV drone highway inspection: a lack of dedicated aerial image datasets for road inspection targets, class imbalance in existing datasets, and a predominant focus on pavement distress over other critical infrastructure. Furthermore, achieving a balance between detection accuracy and real-time processing speed suitable for UAV drone deployment remains a significant hurdle.

To address the issues of insufficient detection accuracy and limited real-time performance caused by small-scale targets, complex backgrounds, and high computational overhead in China UAV drone highway inspection tasks, this study proposes an improved lightweight detection algorithm named YOLOv11-UR, based on the newly released YOLOv11n framework. The core contributions include integrating a Multi-Level Channel Attention (MLCA) mechanism into the backbone network, employing GSConv in the neck for lightweight feature processing, and replacing the standard C3k2 module with a VoVGSCSP structure for efficient feature aggregation. We first construct a dedicated UAV drone road inspection dataset tailored to Chinese highway conditions. Subsequently, the proposed YOLOv11-UR model is developed and evaluated. Finally, through ablation studies, comparative experiments, and visual analysis, the effectiveness of our improvements is thoroughly validated, demonstrating the superiority of YOLOv11-UR for practical China UAV drone inspection scenarios.

Dataset Construction for UAV-Based Highway Inspection

To tackle the issues of scarce and imbalanced datasets for traffic infrastructure, this work constructs a specialized UAV drone road inspection dataset. Image acquisition was conducted on a typical bidirectional four-lane provincial highway in Nanjing, Jiangsu Province, China. This route experiences frequent heavy-duty traffic, leading to significant pavement service pressure and a rich distribution of distress types, making it highly representative. Images were captured by a drone flying at a low altitude and shooting vertically downwards, offering the benefits of non-contact operation, high coverage, and minimal traffic disruption. A total of 1,800 high-resolution road images were initially collected.

UAV Flight and Camera Parameters

Considering factors like cruising capability, wind resistance, safety, and feasibility, a commercially available and widely used DJI Mavic 3 Pro drone was selected as the image acquisition platform. The camera parameters were set as follows: a resolution of 12 megapixels (4032 × 2268), a focal length of 19.35 mm, and a 1/1.3-inch sensor. Through investigation into roadside obstacle heights and the practical impact on aerial image quality, the flight parameters were determined. The UAV drone was flown at an altitude of 50 meters, a speed of 3 m/s, with a vertical nadir viewing angle, and a shooting interval of 2 seconds. The key parameters for the China UAV drone data collection are summarized in Table 1.

Parameter	Value
Image Resolution (pixels)	4032 × 2268
Sensor Size (mm)	9.65 × 7.2
Focal Length (mm)	19.35
Flight Altitude (m)	50
Flight Speed (m/s)	3
Viewing Angle	Vertical Nadir
Shooting Interval (s)	2

A drone flying over a highway for inspection.

Dataset Curation and Annotation

Guided by Chinese standards such as “Technical Standard of Highway Maintenance” (JTG 5110) and “Technical Specifications for Maintenance of Highway Asphalt Pavements” (JTG 5142), and combined with practical engineering surveys, the key focuses and technical requirements for daily road inspection were defined. Considering the operational characteristics and feasibility of UAV drone inspection, the inspection targets in this study are categorized into two main groups: asphalt pavement distress and traffic infrastructure defects. Specifically, the 14 target classes include: longitudinal crack, transverse crack, diagonal crack, alligator crack, pothole, patched area (block), patched area (strip), missing marking, damaged drainage facility, damaged curb, damaged guardrail, vegetation encroachment, water accumulation, and foreign object on pavement.

The large-sized original images captured by the China UAV drone were cropped into smaller 640×640 pixel patches to fit the input specifications of target detection models and meet computational constraints. Data augmentation techniques were applied, especially for classes with fewer samples, to enhance the model’s robustness. These techniques included random combinations of flipping, translation, rotation, brightness adjustment, noise addition, and random occlusion. Annotation was performed using LabelImg, resulting in a final dataset comprising 7,020 images with 160,008 labeled instances. The dataset was split into training, validation, and test sets in a 7:1:2 ratio. The distribution of labels across the 14 categories is illustrated in Table 2, showing a realistic, imbalanced distribution reflective of real-world conditions.

Target Category	Number of Instances	Target Category	Number of Instances
Longitudinal Crack	15,420	Missing Marking	8,905
Transverse Crack	14,832	Damaged Drainage	9,112
Diagonal Crack	13,754	Damaged Curb	10,234
Alligator Crack	11,208	Damaged Guardrail	9,876
Pothole	10,567	Vegetation Encroachment	11,045
Patched Area (Block)	12,409	Water Accumulation	9,543
Patched Area (Strip)	11,897	Foreign Object	11,206

The Proposed YOLOv11-UR Method

YOLOv11 is a state-of-the-art one-stage object detector introduced in 2024, building upon the classic backbone-neck-head architecture. Its backbone incorporates modules like C3k2 with dynamic kernel selection and C2PSA with position-sensitive attention. The neck uses a FPN+PAN structure, and the head employs depthwise separable convolutions for efficiency. However, for China UAV drone highway inspection, challenges like small targets, complex backgrounds, and the need for real-time processing can lead to missed detections, false alarms, and suboptimal performance with the base model. To address these, we propose the YOLOv11-UR model with three key improvements.

Backbone Enhancement with Mixed Local Channel Attention (MLCA)

The original YOLOv11n backbone relies on standard convolutions, which may have limitations in modeling global context and channel-wise dependencies, affecting performance on small or occluded targets common in UAV drone imagery. We integrate the Mixed Local Channel Attention (MLCA) module into the backbone. The MLCA mechanism simultaneously performs channel recalibration and spatial local-global information modeling. Its workflow can be described as follows:

1. The input feature map is processed by Local Average Pooling (LAP) to preserve local spatial details.
2. The features are then split into two parallel branches. One branch applies Global Average Pooling (GAP) to capture image-level global context, compressing spatial information into channel descriptors. The other branch retains the locally pooled features.
3. Both branches undergo feature encoding and dimensionality reduction via 1D convolutions (Conv1d) to establish nonlinear channel dependencies.
4. The features from both branches are restored to the original resolution via unpooling and fused through element-wise addition.

This process allows local details and global semantics to complement and enhance each other. The integration of MLCA enables explicit modeling of channel and spatial attention within the forward convolution flow, significantly strengthening the model’s feature representation power for targets in complex environments encountered during China UAV drone inspection, with minimal parameter increase. The attention weight $A$ for a feature map $F$ can be conceptualized as:

$$ A(F) = \sigma(\text{Conv1d}(\text{GAP}(F)) + \text{UNAP}(\text{Conv1d}(\text{LAP}(F)))) $$

where $\sigma$ denotes the sigmoid activation function.

Lightweight Convolution in the Neck with GSConv

In the neck section, we replace standard convolutions with GSConv to balance computational efficiency and feature expression. GSConv hybridizes standard convolution (Conv) and depthwise separable convolution (DWConv). The input features first pass through a standard convolution layer to extract cross-channel semantic information. This output then enters a depthwise separable convolution layer for intra-channel feature refinement. The features from both paths are concatenated along the channel dimension and then fused via a channel shuffle operation. This shuffle allows the high-level information from the standard convolution to permeate the lightweight features generated by the depthwise convolution.

Compared to the standard convolutions in YOLOv11n, GSConv significantly reduces inter-channel redundancy. It lowers the Floating Point Operations (FLOPs) and parameter count while maintaining a feature distribution close to that of standard convolution. This enhances the model’s real-time performance for UAV drone applications without sacrificing accuracy, providing a foundation for further lightweight design. The GSConv operation can be summarized as:

$$ \text{GSConv}(F) = \text{Shuffle}(\text{Concat}(\text{Conv}(F), \text{DWConv}(\text{Conv}(F)))) $$

Efficient Feature Extraction with VoVGSCSP

The C3k2 module in YOLOv11n’s neck uses stacked standard convolutions, which can be computationally heavy. We introduce the VoVGSCSP module, inspired by VoVNet’s One-Shot Aggregation (OSA) mechanism, to replace C3k2. OSA aggregates the outputs of multiple feature extraction units at once in the channel dimension, followed by a convolution for integration. This avoids repetitive computation and gradient blocking associated with traditional sequential stacking, allowing shallow features to directly participate in deep feature fusion.

In the VoVGSCSP structure, input features are processed by two parallel branches. One branch goes through a lightweight GSBottleneck unit composed of two GSConv layers with a residual connection. The other branch uses a standard convolution for linear transformation. The outputs of both branches are concatenated and then integrated by a final convolution layer. VoVGSCSP combines the efficient feature aggregation of OSA with the lightweight refining capability of GSConv. It reduces parameters and FLOPs while improving feature extraction and fusion through effective information flow, providing more discriminative semantic information for subsequent detection in China UAV drone images.

Experimental Setup and Evaluation Metrics

Implementation Details

Experiments were conducted on a workstation with an Intel Core i9-10900X CPU and an NVIDIA GeForce RTX 2080 Ti GPU. The software environment included Windows 11, Python 3.9, PyTorch 2.0.0, and CUDA 11.8. The model was trained for 350 epochs with a batch size of 32. The Stochastic Gradient Descent (SGD) optimizer was used with an initial learning rate of 0.01, momentum of 0.9, and weight decay of 0.0005. Input images were resized to 640×640 pixels. All models were trained from scratch without pre-trained weights to ensure a fair comparison under identical hyperparameter settings.

Evaluation Metrics

To comprehensively evaluate model performance, standard object detection metrics were employed. These include Precision ($P$), Recall ($R$), and F1-Score, which measure the model’s discriminative ability. Average Precision (AP) for each class and the mean Average Precision (mAP) over all classes reflect the overall detection accuracy. For computational efficiency, we report parameters (Params), FLOPs, model size, and inference speed in Frames Per Second (FPS). The metrics are defined as follows:

Intersection over Union (IoU) measures localization accuracy:
$$ \text{IoU} = \frac{|A \cap B|}{|A \cup B|} $$
where $A$ is the predicted bounding box and $B$ is the ground truth box. A prediction is considered a True Positive (TP) if $\text{IoU} \geq 0.5$ and the classification is correct.

$$ P = \frac{TP}{TP + FP} $$
$$ R = \frac{TP}{TP + FN} $$
$$ \text{F1-Score} = \frac{2 \times P \times R}{P + R} = \frac{2 \times TP}{2 \times TP + FP + FN} $$
where $FP$ is the number of false positives and $FN$ is the number of false negatives.

The Average Precision (AP) is the area under the Precision-Recall curve:
$$ AP = \int_{0}^{1} P(R) \, dR $$
The mean Average Precision (mAP) is the average of AP across all $N$ classes:
$$ mAP = \frac{1}{N} \sum_{i=1}^{N} AP_i $$
In this study, we report mAP at an IoU threshold of 0.5 (mAP@0.5).

Results and Analysis

Ablation Study

Ablation experiments were conducted to validate the contribution of each proposed component. Starting from the baseline YOLOv11n, modules were added incrementally. The results are summarized in Table 3.

Model	MLCA	GSConv	VoVGSCSP	Params (M)	FLOPs (G)	FPS	P (%)	R (%)	F1 (%)	mAP (%)
YOLOv11n (Baseline)				2.58	6.3	247.4	77.76	63.43	69.14	70.25
A	✓			2.58	6.3	246.0	76.01	65.15	69.43	70.75
B		✓		2.50	6.2	272.1	79.71	63.01	69.66	70.98
C			✓	2.57	5.9	252.7	78.66	65.14	70.46	71.13
D	✓	✓		2.50	6.2	125.8	73.94	66.61	69.30	70.94
E	✓		✓	2.66	6.1	121.0	75.56	67.21	70.65	72.85
F		✓	✓	2.57	5.9	248.8	78.92	65.04	70.51	71.12
YOLOv11-UR (Ours)	✓	✓	✓	2.57	5.9	237.9	78.26	66.60	71.40	73.34

The results clearly demonstrate the effectiveness of each module. Adding MLCA (Model A) improved Recall and mAP by 1.27% and 0.5% respectively, enhancing feature representation with negligible cost. Integrating GSConv (Model B) achieved the highest Precision (79.71%) and FPS (272.1), while reducing parameters, showcasing its superiority in lightweight efficiency. Employing VoVGSCSP (Model C) yielded the lowest FLOPs (5.9G) and a notable mAP gain of 0.88%, proving its strength in efficient feature extraction.

Combinations of two modules (D, E, F) showed further synergistic improvements, with Model E (MLCA+VoVGSCSP) achieving a significant mAP of 72.85%. Our full model, YOLOv11-UR, which integrates all three improvements, achieves the best overall performance. It attains the highest Recall (66.60%), F1-Score (71.40%), and mAP (73.34%), representing a substantial 3.09% mAP improvement over the baseline. Crucially, it accomplishes this while reducing parameters by 0.48% and FLOPs by 6.35%, and maintaining a high inference speed of 237.9 FPS. This confirms that our proposed modifications successfully balance accuracy and efficiency for China UAV drone inspection tasks.

Comparative Study with State-of-the-Art Models

To thoroughly evaluate YOLOv11-UR, we compared it against several prominent lightweight CNN-based detectors (YOLOv5n, YOLOv8n, YOLOv12n) and a representative efficient Transformer-based detector (RT-DETR-R18). All models were trained and tested on our UAV drone highway dataset under identical conditions. The results are presented in Table 4.

Model	Params (M)	FLOPs (G)	Size (MB)	FPS	P (%)	R (%)	F1 (%)	mAP (%)
YOLOv5n	2.51	7.1	5.0	264.2	71.29	56.04	61.82	63.46
YOLOv8n	3.01	8.1	6.0	270.3	72.42	61.68	65.54	67.63
YOLOv12n	2.51	5.8	5.2	178.3	66.64	62.34	64.05	65.65
YOLOv11n (Baseline)	2.58	6.3	5.3	247.4	77.76	63.43	69.14	70.25
RT-DETR-R18	19.89	57.0	38.6	107.7	79.26	69.25	73.71	72.67
YOLOv11-UR (Ours)	2.57	5.9	5.3	237.9	78.26	66.60	71.40	73.34

Among models with comparable parameter counts (around 2.5M), YOLOv11-UR outperforms all others across Precision, Recall, F1-Score, and mAP. It surpasses the strong baseline YOLOv11n by 3.09% mAP. Compared to the larger RT-DETR-R18 model (19.89M params), YOLOv11-UR achieves a slightly higher mAP (73.34% vs. 72.67%) while being dramatically more efficient. Our model has only 12.9% of the parameters, 10.4% of the FLOPs, 13.7% of the file size, and delivers 2.2 times the FPS of RT-DETR-R18. This demonstrates that YOLOv11-UR achieves an excellent trade-off, offering superior or competitive accuracy with the lightweight footprint necessary for real-time processing on UAV drone platforms or edge devices in China.

Visualization and Practical Application

Visual comparisons further illustrate the strengths of YOLOv11-UR. In complex scenes with dense small targets (e.g., clustered potholes and patches) or background clutter (e.g., tree shadows on pavement), baseline models like YOLOv11n often produce redundant bounding boxes, false positives, or miss detections altogether. For instance, shadows might be misclassified as patches, or multiple overlapping boxes might be predicted for a single crack. In contrast, YOLOv11-UR demonstrates more precise and robust detection. It accurately locates targets like missing road markings and small potholes with well-defined bounding boxes and higher confidence scores, showing fewer instances of false alarms or missed targets. The integration of MLCA likely enhances focus on relevant features amidst clutter, while the efficient neck design preserves detailed information for small objects.

A field validation was conducted on a highway in Nanjing, China, using a UAV drone for real-time capture and processing. The YOLOv11-UR model, deployed on a companion computing device, successfully identified various pavement distresses and infrastructure issues in real-time, demonstrating its practical applicability for daily China UAV drone highway inspection routines. The model’s balance of speed and accuracy enables timely detection and reporting, which is crucial for effective maintenance decision-making.

Conclusion

This study addresses the challenges of multi-object detection for daily highway maintenance inspection using UAV drones in China. The primary contributions are threefold. First, a dedicated aerial image dataset was constructed, encompassing 14 categories of pavement distress and infrastructure defects relevant to Chinese highway standards. Second, an improved lightweight detection algorithm, YOLOv11-UR, was proposed. It incorporates a Mixed Local Channel Attention (MLCA) mechanism to enhance feature representation, utilizes GSConv for lightweight processing in the neck, and employs a VoVGSCSP module for efficient feature aggregation. Third, comprehensive experiments validated the model’s efficacy.

The ablation study confirmed the individual and synergistic contributions of each component, with the full model achieving a 3.09% mAP improvement over the YOLOv11n baseline while reducing parameters and FLOPs. Comparative experiments showed that YOLOv11-UR outperforms other lightweight models (YOLOv5n, YOLOv8n, YOLOv12n) and achieves a higher mAP (73.34%) than the much larger RT-DETR-R18 model, all while maintaining high inference speed (237.9 FPS). This makes it highly suitable for real-time analysis on UAV drone platforms. Visualization results further demonstrated its robustness in handling small targets and complex backgrounds, reducing false positives and missed detections.

In conclusion, the proposed YOLOv11-UR algorithm effectively balances detection accuracy and computational efficiency, meeting the core requirements for automated, real-time China UAV drone highway inspection. Future work will focus on expanding the dataset to cover more diverse scenarios and weather conditions, exploring knowledge distillation or quantization for further model compression, and integrating the detection system into a full workflow for automated condition assessment and reporting, thereby enhancing the intelligence level of highway maintenance operations in China.