UAV-Based Wind Turbine Hub Inspection: An Advanced Image Segmentation Methodology

The safe and stable operation of wind turbines, core equipment in the global transition to renewable energy, is of paramount importance. The hub, a critical component connecting the blades to the main shaft, is subjected to immense cyclic mechanical stresses during prolonged operation. This can lead to structural damage such as microscopic cracks, corrosion, or bolt loosening. Left undetected, these faults can escalate into catastrophic failures, resulting in significant safety hazards and economic losses. Traditional manual inspection methods are plagued by inefficiency, high risk, and numerous blind spots. The advent of UAV drone technology, equipped with high-resolution visual sensors, offers a transformative solution for rapid and comprehensive data acquisition, positioning it as the mainstream approach for intelligent hub maintenance.

However, the task of segmenting targets, specifically the hub structure, from UAV drone inspection imagery presents formidable challenges. These images are often characterized by complex backgrounds (e.g., towers, vegetation, sky), low-contrast and minute defect features (sub-millimeter cracks), severe illumination variations, and occlusions from other mechanical parts. Conventional segmentation methods, including those based on superpixel merging or classical machine learning models, frequently struggle in this domain. They often fail to adequately capture and utilize the intricate spatial positional and hierarchical channel information present in the images, leading to suboptimal segmentation performance, especially for fine details and small targets. To address these critical limitations and enable automated defect diagnosis, there is a pressing need for high-precision, efficient, and robust target segmentation methodologies tailored for UAV drone imagery in wind farm operations.

Proposed Methodology: An Enhanced BiSeNet V2 Framework

To effectively tackle the segmentation难题 posed by UAV drone inspection images of wind turbine hubs, we propose a target segmentation method centered around an improved BiSeNet V2 architecture. The core innovation lies in augmenting this efficient two-branch network with mechanisms specifically designed to enhance feature representation, fusion, and training for the unique challenges of this application. The overall architecture is designed to balance segmentation accuracy with computational efficiency, a crucial consideration for processing large volumes of data collected by UAV drone fleets.

The foundational BiSeNet V2 model employs a dual-branch structure to achieve a balance between detail preservation and semantic understanding:

Detail Branch: This branch is a lightweight convolutional network (e.g., stacked 3×3 convolutions) responsible for capturing high-resolution spatial details and fine-grained edges of the hub. It maintains a large feature map size to preserve precise spatial information, which is vital for accurately delineating the hub’s contours and small structural components.
Semantic Branch: This branch utilizes a fast-downsampling stem followed by a context path (e.g., with Global Average Pooling and feature pyramid modules) to rapidly enlarge the receptive field and extract rich, high-level semantic features. This enables the model to understand the “context” of the hub within the complex scene, distinguishing it from background clutter commonly encountered in UAV drone footage.

The outputs from these two branches are fused using a Bilateral Guided Aggregation (BGA) layer. Our proposed enhancements are integrated at three key points within this framework: the feature extraction process, the feature fusion mechanism, and the optimization objective. A summary of the model’s key components and their comparative roles is provided in Table 1.

Table 1: Summary of Key Components in the Proposed Segmentation Model
Component	Primary Function	Key Enhancement for UAV Imagery
Detail Branch	Extracts high-resolution spatial features (edges, texture).	Preserves fine details of hub bolts, seams, and potential micro-cracks.
Semantic Branch	Extracts high-level contextual features.	Provides global scene understanding to isolate hub from complex background (tower, sky, vegetation).
Hybrid Attention Module (HAM)	Enhances feature representation by focusing on important spatial locations and channels.	Amplifies responses from small, subtle hub defects and suppresses irrelevant background noise.
Enhanced BGA Layer	Intelligently fuses features from the Detail and Semantic branches.	Ensures deep semantic guidance refines shallow details, crucial for precise boundary localization in cluttered UAV drone views.
Dynamic Threshold Loss Function	Guides the model training by penalizing prediction errors.	Dynamically adjusts focus to improve recognition of small, hard-to-segment hub components and defects.

1. Hybrid Attention Module for Enhanced Feature Representation

The raw features extracted by the two branches, while informative, may not optimally highlight the most relevant information for segmenting the hub, especially when defects are minuscule or contrast is low. To address this, we integrate a Hybrid Attention Module (HAM) that sequentially applies channel and spatial attention mechanisms. This module allows the model to adaptively “pay more attention” to informative feature channels and crucial spatial regions, significantly boosting its sensitivity to the subtle features characteristic of hub inspection imagery from UAV drones.

Channel Attention (CA): This sub-module focuses on “what” is important in the feature map. It squeezes global spatial information from a high-dimensional input feature map $ X_H \in \mathbb{R}^{C \times H \times W} $ using both average and max pooling operations along the spatial dimensions, generating two distinct channel-wise descriptors:
$$ z_{avg} = \text{GAP}(X_H), \quad z_{max} = \text{GMP}(X_H) $$
where $ \text{GAP} $ and $ \text{GMP} $ denote Global Average Pooling and Global Max Pooling, respectively. These descriptors are then processed by a shared multi-layer perceptron (MLP) to produce a channel attention map $ A_c \in \mathbb{R}^{C \times 1 \times 1} $:
$$ A_c = \sigma \left( \text{MLP}(z_{avg}) + \text{MLP}(z_{max}) \right) $$
Here, $ \sigma $ is the sigmoid activation function. The input features are then recalibrated by element-wise multiplication: $ X’_H = A_c \otimes X_H $.

Spatial Attention (SA): This sub-module focuses on “where” the informative regions are. It takes the channel-refined features $ X’_H $ and a complementary low-level detail feature map $ X_L $. First, it aggregates channel information by applying average and max pooling along the channel axis:
$$ s_{avg} = \text{AvgPool}_{channel}(X’_H), \quad s_{max} = \text{MaxPool}_{channel}(X’_H) $$
The resulting maps are concatenated and convolved by a standard 7×7 convolution layer to generate a spatial attention map $ A_s \in \mathbb{R}^{1 \times H \times W} $:
$$ A_s = \sigma \left( f^{7 \times 7}([s_{avg}; s_{max}]) \right) $$
The large 7×7 kernel is chosen to provide a broad receptive field, helping to capture larger structural context of the hub from the UAV drone perspective. The final output of the HAM is the spatially and channel-wise refined feature map:
$$ X_{out} = A_s \otimes X’_H $$
The HAM is strategically placed within the network to process multi-scale features, ensuring that both semantic and detail pathways learn to focus on the most discriminative information for hub segmentation.

2. Enhanced Bilateral Guided Aggregation Layer

Effective fusion of the shallow detail features (rich in spatial info, poor in semantics) and deep semantic features (rich in context, low in resolution) is critical. The original BGA layer performs this fusion. We enhance this process to ensure deeper semantics effectively guide the refinement of shallower details, which is paramount for accurate boundary delineation in complex UAV drone scenes with occlusions and shadows.

Let $ F_d $ denote features from the Detail Branch and $ F_s $ denote features from the Semantic Branch. The enhanced BGA layer performs a guided aggregation as follows:

Guidance Generation: From the semantic feature $ F_s $, a guidance map $ G $ is generated via a lightweight convolution block: $ G = \psi(F_s) $, where $ \psi $ represents a small convolutional network. This map encodes semantic priors about “what should be emphasized.”
Detail Transformation: The detail feature $ F_d $ is processed through a separate convolution block to align its channels and prepare it for fusion: $ F’_d = \phi(F_d) $.
Guided Fusion: Instead of simple addition or concatenation, the fusion is conditioned on the guidance map. One effective mechanism is using the guidance map to weight the contribution of detail features:
$$ F_{fused} = \alpha(G) \odot F’_d + \beta(G) \odot \text{UP}(F_s) $$
where $ \odot $ is element-wise multiplication, $ \text{UP} $ denotes upsampling to match spatial dimensions, and $ \alpha $ and $ \beta $ are learnable functions (e.g., 1×1 convolutions) that generate modulation weights from the guidance map $ G $. This allows the network to dynamically decide, based on semantic context, how much to trust the detailed spatial information at each location, effectively suppressing spurious details from the background while enhancing relevant hub boundaries.

3. Dynamic Threshold Loss Function for Optimized Training

Training a segmentation model for UAV drone hub images involves a severe class imbalance: a vast majority of pixels belong to the background (sky, tower, ground), while the hub and, especially, potential defect pixels constitute a tiny minority. Standard loss functions like Cross-Entropy can become dominated by the background class, leading to poor learning of the target classes.

To combat this, we propose a Dynamic Threshold Loss (DTL) function that adaptively adjusts its focus during training. The total loss $ L_{total} $ is a combination of a segmentation loss $ L_{seg} $ and an auxiliary loss $ L_{aux} $ from the semantic branch’s side output: $ L_{total} = L_{seg} + \lambda L_{aux} $, where $ \lambda $ is a balancing weight.

The core innovation lies in $ L_{seg} $, which is a modified form of focal loss combined with a dynamically adjusted probability threshold. Let $ p_t $ be the model’s estimated probability for the true class. The standard focal loss is:
$$ L_{focal} = -\alpha_t (1 – p_t)^{\gamma} \log(p_t) $$
where $ \alpha_t $ is a balancing factor for class $ t $, and $ \gamma $ is a focusing parameter. We introduce a dynamic threshold $ T_k $ for the $ k $-th class (e.g., hub, defect, background), which is updated based on the model’s performance on the training batch. The modified loss discourages the model from being over-confident on easy background pixels and focuses more on harder foreground (hub) pixels and those where prediction confidence is around the threshold:
$$ L_{seg} = -\frac{1}{N} \sum_{i=1}^{N} \sum_{k=1}^{K} \mathbb{1}_{[y_i=k]} \cdot \alpha_k \cdot \left| p_{i,k} – T_k \right|^{\gamma} \cdot \log(p_{i,k}) $$
Here, $ N $ is the number of pixels, $ K $ is the number of classes, $ y_i $ is the ground truth label, $ p_{i,k} $ is the predicted probability for class $ k $ at pixel $ i $, and $ \mathbb{1} $ is the indicator function. The threshold $ T_k $ for each class can be set as, for instance, the mean predicted probability for that class in the previous batch or a running average. This dynamic mechanism continuously challenges the model, particularly on classes that are being segmented poorly (like small hub parts), leading to more robust feature learning for the challenges inherent in UAV drone imagery.

Experimental Setup and Implementation

To validate the proposed methodology, a comprehensive experimental study was conducted. A specialized dataset was constructed using imagery collected from 3MW-class wind turbines at a major wind farm. A UAV drone (DJI Matrice 300 RTK) equipped with a high-resolution Zenmuse H20T visual sensor (20 MP visible light camera with 20x optical zoom) was deployed for inspection flights. Flights were conducted under various lighting conditions and angles to capture the diversity of real-world scenarios. The hub region in each image was meticulously annotated by experts at the pixel level, creating ground truth masks for segmentation evaluation. The dataset was split into training, validation, and test sets with a ratio of 70:15:15. Key dataset statistics are summarized in Table 2.

Table 2: Description of the UAV Drone Hub Inspection Dataset
Parameter	Specification / Value
Source	On-site UAV Drone Inspection of 3MW Turbines
UAV & Sensor	DJI Matrice 300 RTK with Zenmuse H20T
Image Resolution	Approx. 2000 x 1500 pixels (variable)
Primary Content	Turbine Hub under various views/lighting
Challenges	Complex background, shadows, occlusions, scale variation
Annotation Type	Pixel-level semantic segmentation (Hub vs. Background)
Total Annotated Images	850
Training/Validation/Test Split	595 / 128 / 127 images

The proposed model was implemented using the PyTorch framework. The Detail Branch used three 3×3 convolutional layers with channel dimensions [64, 128, 256]. The Semantic Branch utilized a modified, lightweight ResNet-18 backbone. The HAM was inserted at the junctions of mid-level features. The model was trained using the Adam optimizer with an initial learning rate of 1e-3, a batch size of 8, and for 100 epochs. The dynamic threshold parameters were initialized at $ T_{hub}=0.7, T_{bg}=0.3 $ and updated using an exponential moving average. Performance was evaluated using standard segmentation metrics: Mean Intersection over Union (mIoU) and F1-Score (Dice coefficient).

Results and Analysis

The proposed method demonstrated exceptional performance in segmenting the wind turbine hub from challenging UAV drone inspection imagery. Visual assessment of the results (see examples in the original paper) showed that the model could clearly separate the hub’s main structure, including intricate details like blade connection interfaces, bolt arrays, and surface textures. Boundaries were accurately localized even against complex backgrounds such as the tower, cables, and sky, with minimal instances of under-segmentation (missing parts) or over-segmentation (including background). The model showed strong adaptability to varying lighting conditions and partial occlusions, a testament to the effectiveness of the hybrid attention and guided fusion mechanisms.

Ablation studies were conducted to quantitatively evaluate the contribution of each proposed component. The results, presented in Table 3, clearly indicate that each component – the Hybrid Attention Module (HAM), the enhanced BGA layer, and the Dynamic Threshold Loss (DTL) – provides a significant and cumulative boost to performance. The full model achieves a mIoU of 83.2% and an F1-Score of 88.6%, significantly outperforming the baseline BiSeNet V2 model (72.3% mIoU). This systematic improvement validates the design rationale behind each enhancement.

Table 3: Ablation Study Results (Performance on Test Set)
Model Configuration	mIoU (%)	F1-Score (%)	Relative Improvement (mIoU)
Baseline BiSeNet V2	72.3	78.5	–
+ Hybrid Attention Module (HAM)	76.8	82.1	+4.5
+ Enhanced BGA Layer	78.1	83.4	+5.8
+ Dynamic Threshold Loss (DTL)	79.5	84.7	+7.2
Full Proposed Model (All components)	83.2	88.6	+10.9

Furthermore, comparative experiments were performed against other state-of-the-art methods applicable to UAV drone image analysis, including a Fuzzy Clustering Segmentation method and an Offline Gaussian Model-based method. As shown in Table 4, the proposed model substantially outperforms these methods, underscoring the superiority of the deep learning-based approach enhanced with targeted mechanisms for this specific domain. The gain of over 6-8 percentage points in mIoU is particularly significant in industrial inspection applications, where precision is critical.

Table 4: Comparative Analysis with Other Methods
Method	mIoU (%)	F1-Score (%)	Key Characteristics
Fuzzy Clustering Segmentation	75.1	80.2	Relies on spectral features; struggles with complex textures and shadows common in UAV drone images.
Offline Gaussian Model Method	77.3	81.9	Requires background modeling; sensitive to dynamic scenes and viewpoint changes from the UAV drone.
Proposed Method	83.2	88.6	End-to-end deep learning with attention and dynamic loss; robust to UAV drone imaging variations.

Discussion and Implications

The success of the proposed methodology can be attributed to its holistic design, which addresses the specific pain points of UAV drone-based wind turbine inspection. The Hybrid Attention Module directly counteracts the problem of small, low-contrast target features by forcing the network to amplify relevant signals. The enhanced BGA layer ensures that this refined, spatially-precise information is not washed away during fusion but is instead guided by strong semantic understanding, leading to crisp and accurate boundaries. Finally, the Dynamic Threshold Loss function acts as an intelligent training supervisor, continually pushing the model to improve on the most challenging pixels, which often correspond to the edges of the hub or tiny structural elements.

The high mIoU and F1-Score achieved (83.2% and 88.6%, respectively) are not merely academic metrics. In practical terms, they translate to a highly reliable segmentation mask that can be directly fed into downstream automated defect detection systems. Accurate segmentation isolates the region of interest (the hub), drastically reducing the search space for crack detection algorithms, corrosion assessment tools, or bolt absence checkers. This pipeline automation, enabled by robust segmentation, is the key to scaling UAV drone inspections from a data collection tool to a fully-fledged, predictive maintenance solution. It reduces human workload, minimizes subjective error, and allows for the frequent, systematic inspection of entire wind farms.

Future work will focus on extending this framework to multi-class segmentation, distinguishing not just the hub from background, but also segmenting different defect types (cracks, corrosion, paint loss) directly. Furthermore, exploring real-time or near-real-time versions of this model would enable onboard processing for UAV drones, allowing for immediate anomaly detection during the flight itself. The integration of temporal information from video sequences captured by UAV drones could also provide insights into defect progression over time.

Conclusion

This paper presents a novel, high-performance image segmentation method specifically designed for the challenging task of wind turbine hub inspection using UAV drone imagery. By building upon the efficient BiSeNet V2 architecture and strategically incorporating a Hybrid Attention Module, an enhanced Bilateral Guided Aggregation layer, and a Dynamic Threshold Loss function, the proposed model effectively overcomes the limitations of traditional methods in handling complex backgrounds, minute details, and class imbalance. Comprehensive experimental results on a real-world UAV drone inspection dataset confirm the model’s efficacy, demonstrating superior segmentation accuracy and robustness compared to baseline and alternative approaches. This work provides a reliable and advanced technological foundation for automating the visual inspection process in wind energy maintenance, paving the way for safer, more efficient, and more intelligent management of wind power assets through the extensive use of UAV drone technology.