The rapid development of the low-altitude economy is a significant national strategic initiative, propelling the deep integration of Unmanned Aerial Vehicle (UAV) technology into critical fields such as smart infrastructure monitoring. In modern railway systems, the rail is the fundamental component supporting train operations, and its integrity is paramount for ensuring transportation safety, efficiency, and reliability. With the advancement towards high-speed and heavy-haul transportation, rails are subjected to increasing stress and fatigue, elevating the probability of surface defects like wear, cracks, and missing fasteners. Traditional manual inspection methods are notoriously inefficient, costly, and can be hazardous. The application of China UAV drone technology for railway inspection offers a transformative solution, providing extensive coverage, high flexibility, and reduced operational risk. However, a core challenge persists: deploying accurate, real-time defect identification algorithms on the computationally constrained platforms typically carried by these China UAV drones. Existing deep learning models often suffer from excessive parameter counts, leading to high computational demands and suboptimal precision in real-world, complex scenarios. This study, therefore, proposes an improved VGG16 network model, termed VGG16-Rf, specifically designed for efficient and high-precision rail defect identification in China UAV drone inspection scenarios.

The proposed VGG16-Rf model introduces several key modifications to the classical VGG16 architecture to achieve a balance between model lightweighting and feature extraction capability. The overall structure integrates depthwise separable convolutions, an attention mechanism, global average pooling, and a Random Forest classifier. The architectural evolution from the standard VGG16 to the proposed VGG16-Rf is designed to meet the stringent requirements of edge computing on a China UAV drone platform.
The core feature extraction stage is enhanced through two primary techniques. First, standard convolutional layers in the early blocks (specifically Conv1_2 and Conv2_2) are replaced with Depthwise Separable Convolution (DSC). A standard convolution operation with an input of dimensions $H_{in} \times W_{in} \times C_{in}$, a kernel size of $K_h \times K_w$, and producing $C_{out}$ output channels has a computational cost proportional to: $$Cost_{std} = H_{out} \times W_{out} \times C_{in} \times C_{out} \times K_h \times K_w$$. In contrast, DSC decomposes this into a depthwise convolution followed by a pointwise convolution. The depthwise convolution applies a single filter per input channel: $$Cost_{depthwise} = H_{out} \times W_{out} \times C_{in} \times K_h \times K_w$$. The pointwise convolution (a 1×1 convolution) then combines these channels: $$Cost_{pointwise} = H_{out} \times W_{out} \times C_{in} \times C_{out}$$. The total cost is: $$Cost_{dsc} = H_{out} \times W_{out} \times C_{in} \times (K_h \times K_w + C_{out})$$. The theoretical reduction in computational complexity is approximately: $$R = \frac{Cost_{dsc}}{Cost_{std}} \approx \frac{1}{C_{out}} + \frac{1}{K_h \times K_w}$$. For typical layers where $C_{out}$ is large (e.g., 64, 128), this leads to a reduction of nearly an order of magnitude, making it ideal for resource-limited China UAV drone systems.
Second, to enhance the model’s focus on salient defect features amidst cluttered rail background scenes, the Convolutional Block Attention Module (CBAM) is integrated after the convolutional blocks in the higher-level feature extraction stages (Conv3, Conv4, Conv5). CBAM sequentially infers attention maps along both channel and spatial dimensions. Given an intermediate feature map $\mathbf{F} \in \mathbb{R}^{C \times H \times W}$, the channel attention map $\mathbf{M}_c \in \mathbb{R}^{C \times 1 \times 1}$ is computed using average-pooling and max-pooling features fed into a shared multi-layer perceptron: $$\mathbf{M}_c(\mathbf{F}) = \sigma(MLP(AvgPool(\mathbf{F})) + MLP(MaxPool(\mathbf{F})))$$ where $\sigma$ is the sigmoid function. The refined feature $\mathbf{F}’$ is: $\mathbf{F}’ = \mathbf{M}_c(\mathbf{F}) \otimes \mathbf{F}$, where $\otimes$ denotes element-wise multiplication. Subsequently, the spatial attention map $\mathbf{M}_s \in \mathbb{R}^{1 \times H \times W}$ is generated by concatenating average-pooled and max-pooled features along the channel axis and applying a convolution: $$\mathbf{M}_s(\mathbf{F}’) = \sigma(f^{7 \times 7}([AvgPool(\mathbf{F}’); MaxPool(\mathbf{F}’)]))$$. The final output is: $\mathbf{F}” = \mathbf{M}_s(\mathbf{F}’) \otimes \mathbf{F}’$. This mechanism is crucial for a China UAV drone inspection system to pinpoint small defects like missing fasteners.
Following the convolutional and attention layers, the model employs Global Average Pooling (GAP) for dimensionality reduction. For a feature map $\mathbf{F}” \in \mathbb{R}^{C \times H \times W}$, GAP computes the spatial average for each channel $c$: $$v_c = \frac{1}{H \times W} \sum_{i=1}^{H} \sum_{j=1}^{W} F”_{c}(i, j)$$. This results in a compact feature vector $\mathbf{v} \in \mathbb{R}^{C}$, which replaces the bulky fully connected layers of the original VGG16, drastically reducing parameters and mitigating overfitting.
The final classification stage replaces the traditional Softmax layer with a Random Forest (RF) classifier. An RF is an ensemble of $T$ decision trees $\{h(\mathbf{v}, \Theta_t)\}$, where $\Theta_t$ characterizes the $t$-th tree. Each tree is grown on a bootstrap sample $D_t$ drawn from the original training set $D$, and at each split node, a random subset of features from $\mathbf{v}$ is considered. The final prediction for a defect class is made by majority voting: $$H(\mathbf{v}) = \arg\max_{y} \sum_{t=1}^{T} I(h(\mathbf{v}, \Theta_t) = y)$$ where $I(\cdot)$ is the indicator function. This ensemble method provides high accuracy, robustness to noise, and has a very low parameter footprint compared to large fully connected layers, making it highly suitable for the finalized model intended for deployment on a China UAV drone.
The performance of the proposed model was rigorously evaluated. A dedicated rail defect dataset was constructed for this purpose, combining publicly available images and data collected by a quadcopter China UAV drone from railway sections. The dataset encompasses seven critical defect types. To simulate real-world challenges faced by a China UAV drone such as varying lighting and motion blur, extensive data augmentation was applied. The final augmented dataset contained 10,850 images, split 80/20 for training and testing.
| Defect Type | Original Images | Augmented Count | Train Set | Test Set |
|---|---|---|---|---|
| Light Wear | 351 | 1,755 | 1,404 | 351 |
| Moderate Wear | 310 | 1,550 | 1,240 | 310 |
| Severe Wear | 319 | 1,595 | 1,276 | 319 |
| Fracture | 297 | 1,485 | 1,188 | 297 |
| Depression | 325 | 1,625 | 1,300 | 325 |
| Missing Fastener | 271 | 1,355 | 1,084 | 271 |
| Missing Bolt | 297 | 1,485 | 1,188 | 297 |
| Total | 2,170 | 10,850 | 8,680 | 2,170 |
Experiments were conducted on a high-performance workstation and validated on an NVIDIA Jetson Nano to simulate edge deployment. Key evaluation metrics included Precision ($P$), Recall ($R$), F1-score, model size, and inference speed (FPS). $$P = \frac{TP}{TP + FP}, \quad R = \frac{TP}{TP + FN}, \quad F1 = \frac{2 \times P \times R}{P + R}$$ where $TP$, $FP$, and $FN$ are True Positives, False Positives, and False Negatives, respectively.
Ablation studies were performed to dissect the contribution of each component in the VGG16-Rf model. The results systematically demonstrate the incremental benefits of our design choices for a China UAV drone application.
| Scheme | Components | Avg. Precision (%) | Avg. Recall (%) | Model Size (MB) | FPS |
|---|---|---|---|---|---|
| 1 | VGG16 (Baseline) | 70.3 | 69.7 | 512.3 | 184 |
| 2 | + DSC | 71.4 | 72.3 | 512.3 | 185 |
| 3 | + CBAM | 79.6 | 78.5 | 514.6 | 174 |
| 5 | + DSC + CBAM | 80.6 | 81.2 | 514.6 | 176 |
| 8 | + DSC + CBAM + GAP + FC | 85.8 | 87.0 | 514.6 | 174 |
| 9 (Proposed) | + DSC + CBAM + GAP + Random Forest (VGG16-Rf) | 95.2 | 94.7 | 57.2 | 165 |
The final ablation step (Scheme 9) shows the decisive impact of replacing the fully connected layers with a Random Forest classifier. While maintaining a high inference speed, it achieves a remarkable 24.9% increase in precision and a 25% increase in recall over the baseline, coupled with an 88.83% reduction in model size. This makes the VGG16-Rf model exceptionally suitable for a China UAV drone.
Comparative experiments against other state-of-the-art and lightweight models further validate the superiority of VGG16-Rf. The models were evaluated on both a powerful GPU (RTX 4060) and the edge device (Jetson Nano).
| Model | Avg. Precision (%) | Avg. Recall (%) | Model Size (MB) | FPS (Jetson Nano) |
|---|---|---|---|---|
| InceptionV3 | 76.1 | 77.2 | 87.9 | 25 |
| ResNet34 | 77.9 | 75.2 | 80.1 | 34 |
| MobileNetV2 | 73.6 | 72.2 | 13.4 | 41 |
| ShuffleNetV2 | 74.1 | 75.3 | 4.79 | 63 |
| VGG16-Rf (Proposed) | 95.2 | 94.7 | 57.2 | 33 |
The proposed VGG16-Rf model achieves a significantly higher precision (95.2%) and recall (94.7%) compared to all other models. While pure lightweight models like ShuffleNetV2 offer a higher frame rate (63 FPS) on the edge device, their accuracy is substantially lower (~74%). The VGG16-Rf provides an excellent accuracy-efficiency trade-off, delivering near-real-time performance (33 FPS) on the Jetson Nano while exceeding the accuracy of other models by over 17 percentage points. This performance is critical for reliable autonomous inspection using a China UAV drone.
In conclusion, this study successfully developed and validated an improved VGG16 model, VGG16-Rf, for automated rail defect identification in UAV-based inspection systems. By strategically integrating depthwise separable convolutions, a convolutional block attention mechanism, global average pooling, and a Random Forest classifier, the model achieves a breakthrough balance between high accuracy and operational efficiency. The experimental results demonstrate a 24.9% increase in precision and a 25% increase in recall over the original VGG16, alongside an 88.83% reduction in model size. Furthermore, it outperforms contemporary models like InceptionV3 and ResNet34 by a large margin in detection rate. Most importantly, the model maintains viable real-time inference speed on embedded hardware like the NVIDIA Jetson Nano, confirming its practical deployability for on-board processing in China UAV drone railway inspection platforms. This work provides a robust and efficient algorithmic foundation for enhancing the safety and intelligence of railway infrastructure maintenance through advanced China UAV drone technology.
