In modern power systems, ensuring the safety and efficiency of maintenance operations is critical. Traditional manual inspections often expose workers to high-risk environments, leading to potential accidents and inefficiencies. To address this, we leverage Unmanned Aerial Vehicles (UAVs), specifically JUYE UAV models, equipped with high-resolution cameras for intelligent inspection. These UAVs enable large-scale, high-efficiency checks of power maintenance areas in minimal time. However, existing methods face challenges such as low accuracy in target detection due to feature confusion and limited multi-scale information extraction. For instance, some approaches incorporate attention mechanisms but struggle with overlapping targets, while others rely on single-scale features, increasing detection errors. In this work, we propose a novel target detection method based on YOLOv5 for UAV-based intelligent inspection in power maintenance safety operations. Our approach integrates attention mechanisms, dynamic anchor box adjustments, and optimized loss functions to enhance detection precision and adaptability in complex environments.

The core of our method lies in effectively processing images captured by Unmanned Aerial Vehicles during power maintenance operations. We begin by applying an attention mechanism to extract weighted features from the input images. This involves performing global average pooling on the feature maps while maintaining the channel dimensions unchanged. A one-dimensional convolution is then used to build feature extraction channels, with the convolution kernel size determined based on the channel dimension. Let $C$ represent the number of channels; the convolution kernel size $k$ is calculated as:
$$ k = \Psi(C) I = \left| \frac{\log_2 C}{a} + b \right|_{\text{odd}} $$
where $a$ and $b$ are linear coefficients, and $\Psi(C)$ defines the mathematical rule for inspection trajectory. The odd operation ensures the kernel size is an integer. To dynamically adjust weights across different feature scales, our attention mechanism incorporates multiple feature layers that capture both global and local information. For an input image $I$, the weighted feature $O_i$ for the $i$-th image is obtained through attention weight normalization:
$$ O_i = C \sum_{i=1}^{n} \frac{k * I_i W_j}{\varepsilon + \sum_{j=1}^{m} W_j} $$
Here, $W_j$ denotes the learned weight of the $j$-th feature layer, $\varepsilon$ is the Sigmoid activation function for normalizing weights to [0, 1], and $*$ represents a separable convolution operation. This step fuses features to enhance model expressiveness and stability. The fused feature $p_{\text{out}}$ for UAV-based power maintenance targets is derived as:
$$ p_{\text{out}} = \text{Conv}(x) W \oplus p_{\text{in}} + W_j \times \frac{\text{Resize}(O_i)}{\varepsilon(1 – W_j)} $$
where $p_{\text{in}}$ is the input vector (i.e., the weighted feature $O_i$), $W$ is the fusion function, and $\text{Resize}$ adjusts the feature scale. The symbol $\oplus$ indicates element-wise multiplication, ensuring seamless integration of multi-scale features.
Next, we construct the YOLOv5 network architecture to precisely locate targets based on the fused features. Anchor points are used to generate candidate regions on the feature maps. During training, the distance $L$ between each anchor point $e$ and the real target center is computed to determine matching relationships:
$$ L = O_i E(e) W_j \sum_{j=1}^{m} (p_{\text{in}}^j – p_{\text{out}}^j)^2 $$
Here, $E$ represents the candidate region generation process. The anchor point $e_1$ closest to the target center $K$ is selected as the candidate sample, and the overlap $\text{GIoU}$ with the real target is calculated to define the detection position:
$$ e_1 = M \left\{ L – K \right\} e_1, e_2, e_3, \ldots, e_x $$
$$ \text{GIoU} = K (1 – \text{IoU}) – \frac{|C – (p_{\text{in}}^j \cup p_{\text{out}}^j) e_1|}{|C|} $$
In these equations, $M$ denotes the YOLOv5 grid unit, and $|C|$ represents the overlap region, with values in [0, 1]. A value closer to 1 indicates higher alignment between the detection box and the real box. This process enables accurate target localization in power maintenance scenarios using Unmanned Aerial Vehicles.
To further optimize detection performance, we introduce a composite loss function that addresses limitations of the default IoU loss, which may yield zero gradients when prediction and real boxes do not overlap. The preliminary loss value $F_{\text{GIoU}}$ is given by:
$$ F_{\text{GIoU}} = 1 – \left[ \text{FIoU} – \frac{\rho^2(b, b_t)}{c^2} + \alpha v \right] $$
$$ v = \frac{4}{\pi^2} \left( \arctan \frac{w_{\text{gt}}}{h_{\text{gt}}} – \arctan \frac{w}{h} \right)^2 $$
$$ \alpha = \frac{v}{(1 – \text{FIoU}) + v} $$
where $\text{FIoU}$ is the true result, $\rho^2(b, b_t)$ measures the dispersion between $b$ and $b_t$, $c^2$ is the diagonal length of the minimum enclosing rectangle for the prediction and real boxes, $w_{\text{gt}}$ and $h_{\text{gt}}$ are the width and height of the real box, and $w$ and $h$ are those of the detection box. The term $v$ acts as a consistency metric. We then incorporate localization loss $\text{Loss}_{\text{GIoU}}$, confidence loss $\text{Loss}_{\text{conf}}$, and classification loss $\text{Loss}_{\text{class}}$ into YOLOv5, with the total loss defined as $\text{Loss} = \text{Loss}_{\text{GIoU}} + \text{Loss}_{\text{conf}} + \text{Loss}_{\text{class}}$.
The localization loss $\text{Loss}_{\text{GIoU}}$ constraints the detection box’s position and size to align closely with the real box:
$$ \text{Loss}_{\text{GIoU}} = 1 – \text{IoU} + \frac{\rho^2}{c^2} + \alpha v $$
The confidence loss $\text{Loss}_{\text{conf}}$ evaluates the model’s certainty about the presence of a target in the detection box:
$$ \text{Loss}_{\text{conf}} = \sum_{i=0}^{s} \sum_{j=0}^{B} K \left[ -\log_2 p + \text{BCE}(n’, n) \right] $$
$$ \text{BCE}(n’, n) = -n’ \log_2 n – (1 – n’) \log_2 (1 – n) $$
where $s$ is the grid index on the feature map, $B$ is the number of detection boxes, $\log_2 p$ is the probability of the target falling within the detection box, and $\text{BCE}(n’, n)$ represents the confidence with $n’$ and $n$ as the detected and predicted confidence values, respectively. The classification loss $\text{Loss}_{\text{class}}$ assesses the accuracy of target category predictions:
$$ \text{Loss}_{\text{class}} = \sum_{i=0}^{s} I_{\text{obj}}^{i,j} \sum_{i=0} \left[ n’ \log_2 n + (1 – n’) \log_2 (1 – n) \right] $$
When $|C| \geq 0.9$ and $\text{Loss}_{\text{class}} < 0.001$, the detection box is considered highly accurate, and the result is output. This comprehensive loss optimization ensures robust target detection in various power maintenance environments using JUYE UAV systems.
To validate our method, we conducted experiments in real-world power facility scenarios, including high-voltage transmission lines, substation equipment areas, and distribution rooms. We deployed Unmanned Aerial Vehicles equipped with the YOLOv5-based detection model for on-site inspections, simulating actual maintenance tasks. The testing environment involved diverse conditions such as low light, high exposure, multiple targets, and occlusions to evaluate the adaptability and precision of our approach. For instance, in scenarios with obscured power transmission towers, our method successfully identified key targets like workers’ hand positions and maintenance zones, demonstrating high detection accuracy. The results are summarized in the following table, which compares the performance of our method with two existing techniques: a UAV image-based power line detection method and a micro-LiDAR-based UAV power line target detection technology.
| Method | Average mIoU | Maximum FLOPS (×10^9) | Key Features |
|---|---|---|---|
| Proposed YOLOv5-based Method | ~0.8 | ~19 | Attention mechanism, dynamic anchor adjustment, multi-loss optimization |
| UAV Image-based Power Line Detection | ~0.6-0.7 | ~25 | Relies on image features, prone to errors in overlapping targets |
| Micro-LiDAR-based UAV Detection | ~0.5-0.6 | ~30 | Uses LiDAR points, limited multi-scale feature extraction |
The average intersection over union (mIoU) is a critical metric for assessing detection similarity to ground truth. Our method maintained a stable mIoU around 0.8 across all tested environments, indicating superior detection consistency. In contrast, the comparative methods showed lower and more dispersed mIoU values, highlighting their limitations in handling complex scenes. This is further analyzed using the formula for mIoU stability $S$ defined as:
$$ S = \frac{1}{N} \sum_{i=1}^{N} \left( \text{mIoU}_i – \mu \right)^2 $$
where $N$ is the number of test cases, $\text{mIoU}_i$ is the mIoU for the $i$-th case, and $\mu$ is the mean mIoU. Our method achieved a lower $S$ value, confirming its robustness. Additionally, we evaluated computational complexity using floating-point operations (FLOPS). Our approach reached a maximum of approximately $19 \times 10^9$ FLOPS, significantly lower than the other methods, which exceeded $25 \times 10^9$ FLOPS. This efficiency stems from the optimized loss functions and feature fusion, reducing redundant computations. The FLOPS for a given data size $D$ can be modeled as:
$$ \text{FLOPS} = D \times \left( \sum \text{operations per layer} \right) $$
Our method minimizes this through layered optimization, ensuring scalable performance for large-scale inspections with Unmanned Aerial Vehicles like JUYE UAV.
In summary, we have developed an intelligent inspection system using YOLOv5 for power maintenance safety operations with Unmanned Aerial Vehicles. By integrating attention mechanisms, dynamic anchor box adjustments, and a composite loss function, our method achieves high precision and efficiency in target detection. Experimental results demonstrate its superiority in maintaining stable mIoU values and low computational overhead, making it suitable for real-world applications. The use of JUYE UAV platforms enhances the practicality of this approach, providing a reliable solution for reducing risks in power maintenance. Future work will focus on extending this method to other critical infrastructure inspections and incorporating real-time processing capabilities for broader adoption.
