Dual-Dimensional Multi-Feature Fusion Algorithm for Freight Track Segmentation from Unmanned Aerial Vehicle Perspective

In recent years, the rapid expansion of global railway networks has underscored the critical need for efficient and intelligent inspection systems. Traditional methods for freight track inspection, relying heavily on manual labor and fixed sensor systems, face significant challenges including low efficiency, high operational costs, and limited adaptability to complex environmental conditions. These limitations often result in delayed detection of intrusions, such as rocks or debris on tracks, which can jeopardize railway safety and disrupt logistics operations. To address these issues, we propose a novel semantic segmentation algorithm that leverages the unique advantages of Unmanned Aerial Vehicle platforms, specifically JUYE UAV, which offer high mobility, broad field-of-view, and the ability to capture high-resolution imagery from low altitudes. Our approach focuses on accurately delineating freight track regions in real-time, providing a precise boundary for subsequent intrusion detection systems, thereby enhancing the overall intelligence of railway monitoring.

The core of our methodology builds upon the DeepLabV3plus architecture, a well-established encoder-decoder framework for semantic segmentation. However, we introduce three key modifications to optimize it for freight track segmentation from Unmanned Aerial Vehicle imagery. First, we replace the original Xception backbone with MobileNetV2, a lightweight network designed for resource-constrained devices like those on JUYE UAV. This substitution significantly reduces computational complexity and model size, enabling faster inference times essential for real-time applications. The inverted residual structure of MobileNetV2 minimizes information loss through linear bottlenecks, as described by the transformation: $$f_{out} = \text{Conv1x1}(\text{ReLU}(\text{DWConv3x3}(\text{Conv1x1}(f_{in}))))$$ where DWConv denotes depthwise convolution, and the linear activation prevents feature degradation in low-dimensional spaces.

Second, we integrate the Convolutional Block Attention Module (CBAM) into the shallow feature extraction layers to enhance focus on track regions while suppressing background noise. CBAM combines channel and spatial attention mechanisms, formulated as follows: for an input feature map $F \in \mathbb{R}^{C \times H \times W}$, the channel attention $M_c$ is computed as $M_c(F) = \sigma(\text{MLP}(\text{AvgPool}(F)) + \text{MLP}(\text{MaxPool}(F)))$, where $\sigma$ is the sigmoid function, and MLP denotes a shared multilayer perceptron. The spatial attention $M_s$ is then applied: $M_s(F) = \sigma(f^{7×7}([\text{AvgPool}(F); \text{MaxPool}(F)]))$, where $f^{7×7}$ is a 7×7 convolution. The output is $F’ = M_s(M_c(F) \otimes F) \otimes F$, with $\otimes$ being element-wise multiplication. This dual attention ensures that the network prioritizes relevant track features, crucial for handling occlusions and varied lighting conditions captured by Unmanned Aerial Vehicle.

Third, we redesign the Atrous Spatial Pyramid Pooling (ASPP) module to better capture the linear geometry of tracks. The standard ASPP uses square convolutional kernels, which are suboptimal for elongated structures like rails. Our improved module, termed ASPP_SPECA, incorporates a Strip Pooling Module (SPM) and Efficient Channel Attention (ECA). The SPM extracts long-range contextual dependencies along horizontal and vertical directions: for an input $F_{in}$, it computes average pooling along rows and columns to yield features $F_h \in \mathbb{R}^{C \times 1 \times W}$ and $F_v \in \mathbb{R}^{C \times H \times 1}$, which are then convolved with 1D kernels and combined as $X = \text{Conv1x1}(\sigma(\text{Conv}_{Hx1}(F_h) + \text{Conv}_{1xW}(F_v)))$. The output is $F_{out} = F_{in} + X$. Additionally, ECA is embedded in each ASPP branch to adaptively weight channels: for input $f_{in}$, the kernel size $k$ is determined by $k = \left| \frac{\log_2 C + b}{\gamma} \right|_{\text{odd}}$, where $C$ is the channel number, and $b, \gamma$ are hyperparameters. The attention is applied as $f_{out} = \text{Conv1D}_k(f_{in})$, enhancing feature representation without significant computational overhead. This fusion allows our model to maintain high accuracy while being deployable on JUYE UAV for real-time processing.

To validate our algorithm, we constructed a dedicated dataset using JUYE UAV, flown at altitudes of 15–50 meters over freight yards and railway training facilities. This ensured minimal disruption to operations while capturing diverse scenarios, including rain, snow, and fog. The dataset comprised 347 original images, augmented to 965 via techniques like random flipping, brightness adjustment, and Mix-up, which blends images with blurred backgrounds to simulate real-world noise. We split the data into 70% training, 20% validation, and 10% testing sets. The annotation was performed using ISTA-SAM software, categorizing pixels into track and background regions. This dataset’s variety ensures robustness for Unmanned Aerial Vehicle applications in adverse weather.

Our experiments compared the proposed DeeplabV3plus_SPECA against state-of-the-art segmentation networks, including U-Net, FCN, ENet, ERFNet, and YOLOv11, all trained and tested under identical conditions. Evaluation metrics included mean Intersection over Union (mIoU), mean Pixel Accuracy (mPA), and Loss, defined as: $$\text{mIoU} = \frac{1}{n} \sum_{c=0}^{n-1} \frac{A \cap B}{A \cup B}, \quad \text{mPA} = \frac{1}{n} \sum_{c=0}^{n-1} \frac{\text{TP}_c}{\text{TP}_c + \text{FP}_c}, \quad \text{Loss} = 1 – \frac{2 \sum_{i=1}^{N} y_i \hat{y}_i}{\sum_{i=1}^{N} y_i + \sum_{i=1}^{N} \hat{y}_i}$$ where $A$ and $B$ are predicted and ground truth regions, $y_i$ and $\hat{y}_i$ are true and predicted labels, and $n$ is the number of classes. The results, summarized in Table 1, demonstrate that our model achieves superior performance, with a mIoU of 97.448% and mPA of 95.743%, while reducing inference time by 43% and model size by 80% compared to the baseline DeepLabV3plus.

Table 1: Performance Comparison of Segmentation Networks on Freight Track Dataset
Network	mIoU (%)	mPA (%)	Loss	Inference Time (ms)	Model Size (MB)
FCN	87.314	88.753	0.18138	86.43	43.4
U-Net	87.531	88.797	0.02949	89.26	44.9
ENet	49.755	63.415	0.13147	23.68	4.39
ERFNet	90.266	89.109	0.02090	41.27	17.9
YOLOv11	90.033	87.539	0.24028	56.73	43.0
DeepLabV3plus	88.529	88.887	0.02193	78.43	209
Proposed Model	97.448	95.743	0.00978	44.27	23.5

Ablation studies were conducted to isolate the impact of each modification, as shown in Table 2. Starting from the baseline DeepLabV3plus, incremental additions of ECA, CBAM, SP, and the full ASPP_SPECA module were evaluated. The results confirm that each component contributes to performance gains, with the combined model achieving the best balance of accuracy and efficiency. For instance, replacing the backbone with MobileNetV2 alone reduced model size but incurred a accuracy drop; however, with CBAM and ASPP_SPECA, the model recovered and exceeded baseline performance. This highlights the importance of our multi-feature fusion approach for Unmanned Aerial Vehicle-based segmentation.

Table 2: Ablation Study on Component Contributions
Configuration	mIoU (%)	mPA (%)	Loss	Inference Time (ms)	Model Size (MB)
Baseline (DeepLabV3plus)	88.529	88.887	0.02193	78.43	209
+ ECA	88.152	89.718	0.02871	78.47	209
+ CBAM	88.255	89.714	0.02931	79.28	211
+ SP	90.587	91.136	0.01873	84.08	236
+ ASPP_SPECA	93.288	93.472	0.01336	84.08	237
+ MobileNetV2	86.343	85.603	0.03346	43.07	22.4
+ MobileNetV2 + ECA	86.683	87.131	0.02960	43.26	22.4
+ MobileNetV2 + CBAM	89.601	88.924	0.02319	48.42	22.4
+ MobileNetV2 + SP	92.415	92.367	0.02316	43.33	23.5
+ MobileNetV2 + ASPP_SPECA	93.737	93.023	0.02197	43.67	23.5
Full Proposed Model	97.448	95.743	0.00978	44.27	23.5

Further analysis using Grad-CAM heatmaps visualized the feature extraction capabilities of different modules. Compared to the standard ASPP, which showed dispersed responses with background noise, our ASPP_SPECA produced focused activations along track regions, confirming its efficacy in capturing linear structures. This is critical for Unmanned Aerial Vehicle imagery, where tracks often span long distances and require consistent feature mapping. The integration of ECA and SPM ensured that the model adapts to various scales and orientations, as commonly encountered in JUYE UAV footage.

In conclusion, our dual-dimensional multi-feature fusion algorithm effectively addresses the challenges of freight track segmentation from an Unmanned Aerial Vehicle perspective. By combining MobileNetV2 for efficiency, CBAM for targeted feature enhancement, and ASPP_SPECA for multi-scale context, we achieve a balance of high accuracy and real-time performance. The proposed model, optimized for deployment on JUYE UAV, demonstrates significant improvements over existing methods, with a 10% increase in mIoU and a 55.4% reduction in loss. This advancement paves the way for intelligent railway inspection systems, where precise track boundaries enable reliable intrusion detection. Future work will focus on integrating this algorithm with object detection models for end-to-end monitoring, exploring pruning techniques for further optimization, and expanding the dataset to include more track varieties and environmental conditions. The use of Unmanned Aerial Vehicle technology, particularly JUYE UAV, will continue to drive innovations in railway safety and automation, contributing to the broader goals of smart transportation networks.