
The long-term service of railway steel bridges inevitably subjects their protective coatings to various forms of degradation, including blistering, peeling, and corrosion. Traditional inspection methods, which rely on manual visual checks often conducted under challenging and hazardous conditions, are not only inefficient but also pose significant safety risks. The advent of unmanned drone technology has revolutionized infrastructure inspection, offering a safe, efficient, and scalable solution for capturing high-resolution aerial imagery. However, the automated analysis of these unmanned drone images presents formidable computational challenges. Coating defects often manifest as small, discrete, and irregularly shaped targets within vast, high-resolution scenes. Furthermore, their edges can be indistinct, blending subtly with the intact coating, and the imagery is frequently cluttered with complex background elements unrelated to the coating itself. This necessitates the development of sophisticated vision-based systems capable of precise detection, segmentation, and quantitative assessment.
In this work, we propose a comprehensive framework for the detection and evaluation of steel bridge coating defects from unmanned drone imagery. Our system is designed to overcome the specific challenges of this domain through a multi-stage pipeline involving intelligent preprocessing, a novel deep learning architecture with an integrated attention mechanism, a dedicated post-processing algorithm for aggregating discrete findings, and a final quantitative assessment module.
1. System Framework and Preprocessing
The overarching goal of our system is to transform raw, complex unmanned drone imagery into actionable, quantified assessments of coating health. The process is summarized by the following workflow equation, where $I_{raw}$ is the raw image and $A$ is the final assessment:
$$
A = \mathcal{E}(\mathcal{P}(\mathcal{M}(\mathcal{P}_r(I_{raw}))))
$$
Here, $\mathcal{P}_r$ denotes the initial preprocessing stage, $\mathcal{M}$ represents the core deep learning model for segmentation, $\mathcal{P}$ is the discrete defect aggregation post-processing, and $\mathcal{E}$ is the final evaluation module.
1.1 Coating Foreground Extraction and Image Tiling
The first critical step, $\mathcal{P}_r$, addresses the issue of complex background interference. Directly processing full unmanned drone images containing large areas of sky, tracks, rivers, and other non-coating structures drastically reduces the effective signal (defect pixels) for the model. We employ a semi-automatic method to extract the coating foreground region. A small subset of images is manually annotated to define the coating areas. This data is then used to train a preliminary segmentation model, whose predictions are refined and expanded to automatically generate foreground masks for the entire dataset. The result is a set of images cropped to the relevant coating regions, significantly suppressing irrelevant background noise.
Even within the extracted coating region, defects often occupy an extremely small fraction of the total pixels, presenting a classic small-object detection challenge. To amplify the relative proportion of defect pixels and enhance feature visibility, we implement a fixed-size, non-overlapping tiling strategy. The large coating image $I_{coat}$ of dimensions $H \times W$ is divided into $M \times N$ sub-images $S_{i,j}$, each of size $h \times w$:
$$
S_{i,j} = I_{coat}[i \cdot h : (i+1) \cdot h, \quad j \cdot w : (j+1) \cdot w], \quad i \in [0, M), j \in [0, N)
$$
This preprocessing step $\mathcal{P}_r$ ensures that the subsequent deep learning model receives inputs where potential defects are more prominent, thereby improving learning efficacy and detection sensitivity for targets captured by the unmanned drone.
2. ARSNet: The Core Segmentation Model
At the heart of our system lies $\mathcal{M}$, the Accurate Refinement Segmentation Network (ARSNet). We design ARSNet to balance high precision in segmenting irregular defect boundaries with computational efficiency suitable for processing streams of unmanned drone imagery. The model architecture integrates a powerful Feature-edge and Scale-aware Attention Mechanism (FESAM) to tackle the problems of insignificant edges and multi-scale defects.
2.1 Network Architecture Overview
ARSNet follows an encoder-decoder style structure common to instance segmentation networks, but with optimized components. The Backbone is responsible for feature extraction and consists of stacked modules including C3K2 (a lightweight, residual-influenced convolutional block), SPPF (Spatial Pyramid Pooling Fast) for multi-scale context aggregation, and C2PSA (Cross-Stage Partial Self-Attention) to enhance feature representation through channel attention. The Neck, or feature pyramid network, incorporates our proposed FESAM modules at multiple scales to refine the feature maps before they are passed to the Head. The Head has parallel branches for bounding box prediction (object detection) and pixel-wise mask prediction (instance segmentation).
2.2 Feature-edge and Scale-aware Attention Mechanism (FESAM)
The FESAM module is a dual-branch attention mechanism that jointly models channel-wise and spatial relationships, with a specific design to capture edge details and adapt to varying target scales—crucial for analyzing unmanned drone images where defect size can vary dramatically.
Channel Attention Branch: This branch focuses on “what” is meaningful. Given an input feature map $F_1 \in \mathbb{R}^{C \times H \times W}$, it uses a parameter-efficient approach combining global average pooling and a 1D convolution to generate a channel attention vector. The process is defined as:
$$
F_2 = \sigma \left( \text{Conv1D} \left( \frac{1}{H \times W} \sum_{i=1}^{H} \sum_{j=1}^{W} F_1 \right) \right) \otimes F_1
$$
Here, $\sigma$ is the sigmoid function, Conv1D denotes a one-dimensional convolution, and $\otimes$ signifies channel-wise multiplication. This produces an intermediate feature map $F_2$ enhanced with channel-specific importance.
Spatial Attention Branch: This branch focuses on “where” the informative regions are, with emphasis on multi-scale edge perception. It takes $F_2$ as input. First, average-pooled and max-pooled features across the channel dimension are concatenated to create a spatial descriptor $\tilde{F}_2 \in \mathbb{R}^{2 \times H \times W}$:
$$
\tilde{F}_2 = \text{Concat}\left( \frac{1}{C}\sum_{k=1}^{C} F_2, \quad \underset{k=1:C}{\max}(F_2) \right)
$$
To capture features at multiple receptive fields (simulating different defect scales in the unmanned drone view), $\tilde{F}_2$ is processed by $n$ parallel convolutional layers with different dilation rates $r_n$, yielding multi-scale features $\hat{F}_n$:
$$
\hat{F}_n = \text{Conv}_{dilation=r_n}(\tilde{F}_2), \quad \text{for } n=1,…,N
$$
A channel shuffle operation is applied to features from different branches to encourage cross-scale information exchange. These features are then concatenated and fused via a $1\times1$ convolution, followed by a sigmoid activation to generate the final spatial attention map. The output $F_3$ of FESAM is computed with a residual connection:
$$
F_3 = F_2 + \sigma\left(\text{Conv}_{1\times1}(\text{Concat}(\hat{F}_1, \hat{F}_2, …, \hat{F}_N))\right) \odot F_2
$$
where $\odot$ denotes element-wise multiplication. By embedding FESAM into the feature pyramid, ARSNet becomes highly sensitive to the fine edge details of coating defects and robust to their size variations.
2.3 Loss Function
Training ARSNet involves optimizing a composite loss function $\mathcal{L}_{total}$ that guides both detection and segmentation:
$$
\mathcal{L}_{total} = \lambda_{cls}\mathcal{L}_{cls} + \lambda_{box}\mathcal{L}_{CIoU} + \lambda_{dfl}\mathcal{L}_{DFL} + \lambda_{mask}\mathcal{L}_{mask}
$$
The classification loss $\mathcal{L}_{cls}$ is standard binary cross-entropy. The box regression loss uses Complete IoU ($\mathcal{L}_{CIoU}$) for better overlap and aspect ratio alignment, and Distribution Focal Loss ($\mathcal{L}_{DFL}$) for precise boundary localization. The mask loss $\mathcal{L}_{mask}$ is the binary cross-entropy between the predicted mask $M$ and the ground truth mask $M_{gt}$:
$$
\mathcal{L}_{mask} = \text{BCE}(M, M_{gt})
$$
Typical weight values used are $\lambda_{cls}=0.5$, $\lambda_{box}=7.5$, $\lambda_{dfl}=1.5$, and $\lambda_{mask}=0.5$.
3. Post-Processing and Quantitative Evaluation
The outputs from ARSNet per tile, $\mathcal{M}(S_{i,j})$, require consolidation and interpretation to be useful for inspectors reviewing data from an unmanned drone survey.
3.1 Discrete Defect Aggregation Post-Processing ($\mathcal{P}$)
Two aggregation issues must be resolved: 1) Multiple small, discrete but proximate defect instances of the same class within a single tile that should be considered a single larger defect area, and 2) A single large defect instance that is split across two or more adjacent tiles due to the tiling process.
Our algorithm $\mathcal{P}$ works in two stages. First, within each tile, we merge bounding boxes of the same class with an Intersection over Union (IoU) greater than a very low threshold (e.g., 0.01). This agglomerates closely packed fragments. Second, after mapping all tile-local coordinates back to the global image coordinate system using the known tile indices $(i, j)$ and dimensions $(w, h)$:
$$
x_{global} = x_{local} + j \cdot w, \quad y_{global} = y_{local} + i \cdot h
$$
we analyze defects located near tile borders. For two candidate defect boxes of the same class with global coordinates $(x_1, y_1, x_2, y_2)$ and $(x_3, y_3, x_4, y_4)$, we check for spatial adjacency. For horizontal adjacency (defect split across left-right tiles), the condition is:
$$
(|x_2 – x_3| \leq T_p) \land ((|y_1 – y_3| \leq T_p) \lor (|y_2 – y_4| \leq T_p))
$$
where $T_p$ is a small pixel tolerance (e.g., 5 pixels). A similar condition checks for vertical adjacency. Matched boxes are merged into their minimum enclosing rectangle, and their masks are combined. This step is crucial for providing a coherent and non-redundant defect map from the stitched unmanned drone imagery.
3.2 Defect Deterioration Grade Evaluation ($\mathcal{E}$)
The final step is to assign a quantitative severity grade to each aggregated defect or region. We define the defect area proportion $P_i$ for defect type $i$ as:
$$
P_i = \frac{M_i}{N_{coating}} \times 100\%
$$
where $M_i$ is the total number of pixels classified as defect type $i$ in the global coating foreground, and $N_{coating}$ is the total number of pixels in the coating foreground mask. Based on established engineering standards and adapted for unmanned drone image analysis, we propose the following evaluation criteria:
| Deterioration Grade | Corrosion | Peeling/Blistering |
|---|---|---|
| Minor | $P_i < 0.3\%$ | $P_i < 0.3\%$ |
| Moderate | $0.3\% \leq P_i < 3.0\%$ | $0.3\% \leq P_i < 5.0\%$ |
| Severe | $3.0\% \leq P_i < 5.0\%$ | $5.0\% \leq P_i < 16.0\%$ |
| Critical | $P_i \geq 5.0\%$ | $P_i \geq 16.0\%$ |
This quantitative output provides maintenance crews with clear, actionable priorities directly derived from the autonomous analysis of unmanned drone data.
4. Experimental Analysis and Results
We validate our system on a dataset of railway steel bridge coating images captured by an unmanned drone. The dataset, after preprocessing and tiling, consists of 526 image tiles containing three defect classes: blistering, peeling, and corrosion. The data is split into 418 tiles for training and 108 for testing.
4.1 Ablation Study on FESAM
To demonstrate the efficacy of the FESAM module, we conduct an ablation study by integrating various attention mechanisms into a baseline segmentation model (YOLOv11-seg). Performance is measured by Precision (P), Recall (R), mean Average Precision at IoU=0.5 (mAP50), computational complexity in GFLOPs, and inference speed in Frames Per Second (FPS).
| Model Variant | Detection P (%) | Detection R (%) | Detection mAP50 (%) | Segmentation mAP50 (%) | GFLOPs | FPS |
|---|---|---|---|---|---|---|
| Baseline | 85.3 | 83.0 | 87.6 | 88.2 | 35.3 | 128 |
| + GAM | 89.9 | 85.9 | 90.1 | 89.9 | 39.5 | 92 |
| + CBAM | 88.1 | 83.9 | 89.6 | 89.1 | 35.5 | 125 |
| + PPA | 90.4 | 82.5 | 89.6 | 89.8 | 54.3 | 93 |
| + FESAM (Ours) | 92.2 | 86.1 | 90.7 | 90.3 | 35.3 | 121 |
The results clearly show that FESAM achieves the best balance, delivering the highest precision and recall for both detection and segmentation tasks while maintaining low computational cost and high inference speed, making it ideal for processing data from an unmanned drone.
4.2 Comparative Study with State-of-the-Art Models
We compare our full ARSNet model (with FESAM) against other leading instance segmentation models on our unmanned drone coating defect dataset.
| Model | Detection mAP50 (%) | Segmentation mAP50 (%) | FPS |
|---|---|---|---|
| YOLOv8-seg | 87.9 | 88.7 | 122 |
| YOLOv9-seg | 87.7 | 88.1 | 105 |
| YOLOv10-seg | 88.2 | 88.4 | 127 |
| YOLOv11-seg | 87.6 | 88.2 | 128 |
| YOLOv12-seg | 85.9 | 86.0 | 101 |
| ARSNet (Ours) | 90.7 | 90.3 | 121 |
ARSNet establishes a new state-of-the-art on this specific task, outperforming all contemporary models in segmentation accuracy (mAP50) by a significant margin while maintaining a very competitive inference speed. This validates the effectiveness of our integrated approach, from preprocessing to the novel network architecture, for analyzing unmanned drone imagery.
4.3 Qualitative Results and System Output
Qualitative visualization confirms the system’s capabilities. The preprocessing stage successfully eliminates distracting backgrounds. ARSNet accurately segments defects with irregular boundaries and varying sizes that are characteristic of unmanned drone perspectives. The post-processing algorithm effectively merges scattered detections of rust spots into a contiguous region and correctly reassembles a large peeling defect that was split across two tiles. The final output provides clean, aggregated defect masks alongside their calculated area proportion and corresponding deterioration grade (e.g., “Corrosion: 4.2%, Grade: Severe”), delivering a concise and actionable report from the raw unmanned drone flight data.
5. Conclusion and Future Work
In this work, we have presented a complete, high-precision system for the detection and evaluation of railway steel bridge coating defects using unmanned drone imagery. The system directly addresses the core challenges of small target size, discrete and irregular defect morphology, subtle edge features, and complex backgrounds through a synergistic combination of targeted preprocessing, a novel deep learning model (ARSNet) with a dedicated attention mechanism (FESAM), and intelligent post-processing algorithms for aggregation and quantitative assessment. Experimental results demonstrate superior performance over existing methods in both accuracy and operational efficiency.
The current study primarily validates the system on coating surfaces that are relatively accessible to the unmanned drone (top and side views). Future work will focus on enhancing the model’s robustness and generality. This includes expanding the dataset to encompass more challenging perspectives (e.g., oblique angles, undersides of girders), diverse lighting conditions (e.g., strong shadows, low light), and a wider variety of steel bridge types and designs. Furthermore, integrating temporal analysis from repeated unmanned drone surveys to track defect progression over time would provide even greater value for predictive maintenance strategies, solidifying the role of autonomous unmanned drone systems in modern infrastructure asset management.
