The rapid expansion of modern power grids has led to a dramatic increase in the number of critical assets, particularly overhead transmission line components. Among these, insulators play a vital role in mechanical support and electrical insulation. However, long-term exposure to harsh outdoor environments makes them susceptible to various defects—such as string drop, surface burn, string tilt, and surface cracking—which pose significant risks to grid reliability and safety. Traditional manual inspection methods are increasingly inadequate, being time-consuming, costly, and potentially hazardous. The advent of Unmanned Aerial Vehicle (UAV) drone technology has revolutionized infrastructure inspection, offering a flexible, efficient, and high-resolution data acquisition platform. UAV drones can capture detailed remote sensing imagery of grid equipment from optimal angles and altitudes.

Nevertheless, automating defect recognition from UAV drone imagery presents substantial challenges. These images often exhibit spectral heterogeneity within homogeneous regions, and critical defects (like damaged insulator parts) typically appear as small targets occupying very few pixels. Furthermore, complex backgrounds, varying lighting conditions, occlusions, and atmospheric interference (e.g., haze, clouds) can severely degrade recognition performance. Therefore, developing a robust, automated methodology to enhance UAV drone image quality, extract discriminative features, and accurately localize and classify defects is an urgent and critical research problem for intelligent grid maintenance. This article proposes and elaborates on a complete methodological framework designed to address these challenges.
1. Methodological Framework for Automated Defect Recognition
The proposed framework consists of four core, interconnected stages: UAV drone image enhancement, defect feature extraction using an attention mechanism, optimization via a composite loss function, and the final automated defect identification through a convolutional network architecture.
1.1 Enhancement of UAV Drone Remote Sensing Imagery
The first step is to preprocess the raw imagery captured by the UAV drone to improve its quality and highlight latent features crucial for detecting small defects. We employ a frequency-domain enhancement technique. Let \(F(x, y)\) represent the original UAV-acquired remote sensing image of the power grid equipment. A forward transformation function \(\beta\{\cdot\}\) is applied to convert the image from the spatial domain to the frequency domain.
$$F_Z(x, y) = \beta\{F(x, y)\}$$
Here, \(F_Z(x, y)\) denotes the frequency-domain representation of the UAV drone image. Within this domain, we perform corrective processing by introducing a modulation coefficient \(\gamma\) to adjust specific frequency components, often to suppress noise or enhance edges relevant to defect structures.
$$F_X(x, y) = F_Z(x, y) \times \gamma$$
In this equation, \(F_X(x, y)\) is the corrected frequency-domain feature map. Finally, an inverse transformation \(\beta^{-1}\{\cdot\}\) is applied to reconstruct the enhanced image in the spatial domain.
$$F_L(x, y) = \beta^{-1}\{F_X(x, y)\}$$
The result, \(F_L(x, y)\), is the enhanced UAV drone remote sensing image with improved contrast and clarity, providing a superior input for subsequent feature extraction stages compared to the original raw data from the UAV drone survey.
1.2 Defect Feature Extraction via Coordinate Attention Mechanism
To effectively extract features of small defects from the enhanced UAV drone imagery, we incorporate a Coordinate Attention (CA) module. This module excels at capturing long-range spatial dependencies with precise positional information, which is vital for locating small, anomalous regions within a large image context. The CA mechanism aggregates spatial coordinate information into generated attention maps, allowing the network to focus on areas likely containing defects.
The process begins with the input feature map \(x_k\). Using global average pooling, we generate two sets of 1D feature maps that encode global contextual information along the height and width dimensions, respectively.
$$
\begin{aligned}
Z^h_k(h) &= \frac{1}{W} \sum_{0 \leq i < W} x_k(h, i) \\
Z^w_k(w) &= \frac{1}{H} \sum_{0 \leq j < H} x_k(j, w)
\end{aligned}
$$
Here, \(Z^h_k(h)\) and \(Z^w_k(w)\) are the generated channel attention maps, \(H\) and \(W\) are the height and width of the feature map, and \(h\), \(w\), \(i\), \(j\) are spatial indices. These two maps are then concatenated and processed through a shared \(1 \times 1\) convolutional transformation \(F\), followed by a nonlinear activation function \(l(\cdot)\) (e.g., a h-swish or ReLU function).
$$f = l\left(F\left[Z^w_k(w), Z^h_k(h)\right]\right)$$
The tensor \(f\) is then split along its spatial dimension back into two separate tensors, \(f^h\) and \(f^w\). Another pair of \(1 \times 1\) convolutional transforms, \(F_h\) and \(F_w\), are applied, followed by a sigmoid activation \(\sigma(\cdot)\) to produce the final attention weights for height and width.
$$
\begin{aligned}
g^h &= \sigma\left(F_h(f^h)\right) \\
g^w &= \sigma\left(F_w(f^w)\right)
\end{aligned}
$$
The final output feature map \(Y(i, j)\) of the CA module is obtained by recalibrating the input feature map \(x_k\) with these attention weights:
$$Y(i, j) = x_k(i, j) \times g^h(i) \times g^w(j)$$
This process allows the network to weight important regions in the UAV drone image selectively, significantly boosting its capability to recognize subtle defect patterns that are easily lost in complex backgrounds.
1.3 Optimization with a Composite Loss Function
Training an effective defect recognizer requires a carefully designed loss function that guides the model on multiple objectives: accurate bounding box regression, correct object classification, and reliable confidence scoring. We adopt a composite loss function \(L\) comprising three key components:
$$L = L_{cls} + L_{obj} + L_{CIoU}$$
1. Bounding Box Regression Loss (\(L_{CIoU}\)): We use the Complete Intersection over Union (CIoU) loss, which considers the overlap area, center point distance, and aspect ratio consistency between the predicted box \(A\) and the ground truth box \(B\). The CIoU loss provides more stable gradients for bounding box regression than standard IoU loss, especially for small targets common in UAV drone imagery.
$$
\begin{aligned}
L_{CIoU} &= 1 – IoU + \frac{\rho^2(b, b^{gt})}{c^2} + \alpha v \\
IoU &= \frac{A \cap B}{A \cup B} \\
\alpha &= \frac{v}{(1 – IoU) + v} \\
v &= \frac{4}{\pi^2} \left( \arctan\frac{w^{gt}}{h^{gt}} – \arctan\frac{w}{h} \right)^2
\end{aligned}
$$
Here, \(\rho^2(b, b^{gt})\) is the squared Euclidean distance between the centers of the predicted and ground truth boxes, \(c\) is the diagonal length of the smallest enclosing box covering both boxes, and \(w, h, w^{gt}, h^{gt}\) are the width and height of the predicted and ground truth boxes, respectively.
2. Objectness Confidence Loss (\(L_{obj}\)): This measures the error in the model’s confidence that a given grid cell contains an object (defect). It uses binary cross-entropy.
$$
\begin{aligned}
L_{obj} &= \sum_{i=0}^{S^2} \sum_{j=0}^{B} \mathbb{1}_{ij}^{obj} \left[ – \log(g_0) \right] + \lambda_{noobj} \mathbb{1}_{ij}^{noobj} \left[ – \log(1 – g_0) \right] \\
g_0 &= \sigma(IoU(g^{pre}, g^{truth}))
\end{aligned}
$$
Where \(S^2\) is the number of grid cells, \(B\) is the number of anchor boxes, \(\mathbb{1}_{ij}^{obj}\) and \(\mathbb{1}_{ij}^{noobj}\) are indicators for cells containing/not containing an object, \(g^{pre}\) is the predicted box, \(g^{truth}\) is the ground truth, and \(\lambda_{noobj}\) is a weighting factor for cells without objects.
3. Classification Loss (\(L_{cls}\)): This is the standard cross-entropy loss for multi-class classification (e.g., defect types: drop string, surface burn, etc.).
$$L_{cls} = \sum_{i=0}^{S^2} \sum_{j=0}^{B} \mathbb{1}_{ij}^{obj} \sum_{c \in classes} – \left[ \hat{p}_{ij}(c) \log(p_{ij}(c)) + (1-\hat{p}_{ij}(c)) \log(1-p_{ij}(c)) \right]$$
Where \(\hat{p}_{ij}(c)\) is the ground truth probability (0 or 1) for class \(c\) in the \(j\)-th anchor of the \(i\)-th cell, and \(p_{ij}(c)\) is the predicted probability.
1.4 Automated Defect Identification Architecture
The final stage implements the automated recognition pipeline. The enhanced UAV drone image, after feature extraction via the CA-augmented backbone network (e.g., a modified CSPDarknet), produces a hierarchy of feature maps at different scales. These feature maps are fed into a detection neck (like a Path Aggregation Network – PANet) for multi-scale feature fusion, crucial for detecting defects of varying sizes in UAV drone imagery.
At a fundamental convolutional layer in the network, the operation for generating an output feature map \(Y_i\) from input \(Y_{i-1}\) can be expressed as:
$$Y_i = Y_{i-1} \circledast \kappa_j^i + \lambda_j^i$$
where \(\circledast\) denotes the convolution operation, \(\kappa_j^i\) is the convolutional kernel, and \(\lambda_j^i\) is the bias term. For upsampling feature maps to recover spatial resolution (essential for precise localization of small defects), transposed convolution (or a similar operation like nearest-neighbor upsampling followed by convolution) is employed. The operation for a transposed convolutional layer can be conceptually related to applying the composite loss function’s gradient during backpropagation to learn effective upsampling kernels, refining the feature representation for accurate defect prediction in the final output layers of the network.
2. Experimental Validation and Performance Analysis
To validate the proposed methodology, a dedicated insulator defect dataset was constructed. It combines images sourced from public repositories with original imagery captured by a UAV drone configured for grid inspection. The UAV drone was equipped with a high-resolution aerial survey camera and a real-time kinematic (RTK) positioning system for geotagging. Flight parameters were set to ensure high overlap: a forward overlap of 70% and a side overlap of 40%, maximizing coverage and 3D reconstruction potential. The final curated dataset comprised 1,066 images containing insulators with various defects, faithfully representing the characteristics of real-world UAV drone surveys.
Training and Validation Convergence: The model was trained using the composite loss function. The training and validation loss curves demonstrated stable convergence without signs of overfitting, indicating the robustness of the learning process for the UAV drone image data.
| Training Batch | String Drop | Surface Burn | String Tilt | Surface Crack | Total Defects |
|---|---|---|---|---|---|
| Batch 1 | 19 | 34 | 45 | 26 | 124 |
| Batch 2 | 21 | 19 | 17 | 6 | 63 |
| Batch 3 | 51 | 26 | 36 | 67 | 180 |
| Batch 4 | 101 | 84 | 96 | 50 | 331 |
| Batch 5 | 98 | 91 | 73 | 106 | 368 |
Evaluation Metrics: Performance was assessed using standard object detection metrics: Precision (\(P\)), Recall (\(R\)), and Average Precision (\(AP\)) calculated for an Intersection over Union (IoU) threshold of 0.5. Mean Average Precision (mAP) was also calculated at a stricter IoU threshold of 0.95 (denoted mAP@0.95) to evaluate localization accuracy rigorously.
$$
P = \frac{TP}{TP + FP}, \quad R = \frac{TP}{TP + FN}, \quad AP = \int_{0}^{1} P(R) dR
$$
Where \(TP\) are true positives (correctly identified defects), \(FP\) are false positives, and \(FN\) are false negatives (missed defects).
Performance on Defect Categories: The proposed method achieved high AP scores across all defect types on the UAV drone image dataset, demonstrating its effectiveness. The Precision-Recall (P-R) curves for each defect class showed large area-under-curve values.
| Defect Type | AP@0.5 | Primary Challenge in UAV Imagery |
|---|---|---|
| Surface Burn | 0.988 | Small area, low contrast with background |
| String Drop | 0.955 | Irregular shape, often partial occlusion |
| String Tilt | 0.965 | Geometric feature requiring context |
| Surface Crack | 0.975 | Thin, elongated, low-pixel targets |
Comparative Analysis: The proposed method was benchmarked against several state-of-the-art and relevant defect recognition algorithms using the mAP@0.95 metric on the self-built UAV drone insulator dataset. This stringent metric highlights the method’s superiority in precise localization.
| Recognition Algorithm | mAP@0.95 (%) | Key Characteristics |
|---|---|---|
| Proposed Method (CA-Enhanced) | 99.2 | Uses frequency-domain enhancement and Coordinate Attention for small defects. |
| ER-YOLO Based Algorithm | 91.3 | Focuses on efficient receptive fields. |
| SSIM-Sobel & Multi-Feature Fusion | 89.9 | Relies on handcrafted texture and edge features. |
| Lightweight Target Detection Algorithm | 92.5 | Optimized for speed, may sacrifice some accuracy on small targets. |
The results in Table 3 clearly show that the proposed methodology outperforms the others by a significant margin, achieving a mAP@0.95 of 99.2%. This indicates its exceptional ability not just to detect defects but to localize them with very high precision in UAV drone imagery, a critical requirement for automated inspection systems. Qualitative analysis further confirms that while all methods perform reasonably well on obvious defects like complete string drops, the proposed CA-enhanced method is uniquely effective at identifying subtle defects like small surface burns and fine cracks against complex cluttered backgrounds, a common scenario in real UAV drone flights over grid infrastructure.
3. Discussion and Future Perspectives
The proposed framework successfully integrates several advanced computer vision techniques to address the specific challenges of defect recognition in UAV drone remote sensing images. The frequency-domain enhancement acts as a powerful preprocessing step to compensate for environmental degradation. The integration of the Coordinate Attention mechanism is pivotal; it allows the network to focus computational resources on spatially relevant regions, effectively amplifying the signal from small, pixel-scarce defects that are otherwise dominated by the background in a typical UAV drone image. The composite loss function ensures balanced learning across localization, classification, and confidence estimation tasks.
The experimental validation on a purpose-built dataset proves the method’s practical efficacy. The high mAP scores, especially under the strict IoU=0.95 criterion, demonstrate its potential for deployment in real-world automated inspection systems where accurate defect delineation is as important as detection. The method shows robustness to complex backgrounds, a direct benefit of the attention mechanism learning to suppress irrelevant image regions captured by the UAV drone during its flight path.
Future work will focus on several enhancements. Firstly, extending the framework to a real-time implementation suitable for onboard processing on the UAV drone itself, enabling immediate defect alerts during flight. Secondly, incorporating temporal analysis by processing video sequences from UAV drones to track defect progression over time. Thirdly, expanding the defect taxonomy to include other critical grid assets like conductors, connectors, and towers, and testing under an even wider variety of environmental conditions (e.g., heavy fog, rain, snow) to further stress-test the robustness of the UAV drone-based recognition system. Finally, exploring self-supervised or semi-supervised learning techniques could reduce the dependency on large, meticulously labeled datasets, which are often a bottleneck in applying deep learning to specialized domains like UAV drone inspection.
In conclusion, this methodological framework presents a comprehensive, high-performance solution for the automated identification of defects in power grid equipment using UAV drone remote sensing imagery. By synergistically combining image enhancement, attention-based feature extraction, and a multi-task loss, it effectively overcomes the challenges of small target size and complex backgrounds, paving the way for safer, more efficient, and intelligent grid maintenance operations.
