Intelligent Small Target Recognition for Power Maintenance Using China UAV Drone Inspection Based on Convolutional Neural Networks

With the continuous expansion and increasing complexity of power systems, the requirements for personnel’s professional skills in power operations have become increasingly stringent. In this context, unmanned aerial vehicle (UAV) inspection technology, due to its high flexibility, mobility, and comprehensive coverage capabilities, is gradually becoming a powerful innovative force in the field of power inspection. In China, the application of China UAV drone technology has gained significant traction, particularly in monitoring safety-critical processes during power maintenance operations. However, images captured during power maintenance operations often contain targets with vastly different sizes, and small targets—such as workers without safety helmets or improper tool usage—are particularly challenging to detect accurately due to their low resolution and minimal pixel occupancy. This reduces the overall accuracy of small target recognition, posing risks to safety management. To address this, I propose an intelligent small target recognition method for power maintenance operations using China UAV drone inspection, leveraging a Convolutional Neural Network (CNN) enhanced with adaptive attention mechanisms and multi-scale feature fusion.

Traditional methods for small target detection in UAV imagery often struggle with scale variations and background clutter. For instance, simulation-based training may not generalize well to real-world images, and single-scale detection architectures can lead to localization errors. My approach focuses on enhancing small target features, fusing multi-scale representations, and optimizing the loss function for precise bounding box regression. This method is specifically tailored for China UAV drone applications in power grid safety, where real-time, accurate identification of unsafe behaviors is crucial. The core innovation lies in integrating an adaptive attention module into a CNN framework to emphasize informative regions and suppress noise, followed by a small target intelligent recognition layer that amalgamates features at different scales. Finally, an improved loss function ensures robust convergence and accurate localization. The following sections detail the methodology, experimental validation, and results.

The proposed method begins with constructing a CNN model to extract convolutional features from power maintenance operation images. Let the input image be denoted as $X_{in}$, and let $H_j$ represent the feature map at the $j$-th layer, with $H_0 = X_{in}$. The convolutional layer operation is defined as:

$$ H_j = f(H_{j-1} \otimes W_j + b_j) $$

where $f(\cdot)$ is the activation function, $W_j$ is the weight vector of the convolutional kernel, $b_j$ is the bias vector, and $\otimes$ denotes the convolution operation. The CNN maintains spatial structure while extracting channel-wise convolutional features, which are then fused and passed through pooling layers to capture texture and background information. A fully connected layer with a HardSigmoid (HS) gating structure analyzes dependencies between small target features, suppressing irrelevant channel responses and enhancing learning capability. This process can be expressed as:

$$ X_{out} = \text{Dropout}(X_{in} \cdot \text{HS}) \cdot F_1(\text{AvgPool}(\alpha) + \text{MaxPool}(\beta)) \cdot \text{DSConv}(H_j) $$

Here, $X_{out}$ is the output image, $\alpha$ and $\beta$ are scaling coefficients, $\text{Dropout}(\cdot)$ is a regularization method, $\text{AvgPool}(\cdot)$ and $\text{MaxPool}(\cdot)$ are average and max pooling operations, $\text{DSConv}(\cdot)$ denotes depthwise separable convolution, and $F_1(\cdot)$ is a fusion function. To handle the diverse small targets in China UAV drone imagery, I incorporate an adaptive attention module. First, the frequency of co-occurring components in the image is统计 into a $C \times C$ frequency matrix $R_C$, normalized as:

$$ eR_{mn} = \frac{R_C^{mn}}{\sqrt{D_{mm} D_{nn}}} $$

where $D_{mm}$ and $D_{nn}$ are the sums of elements in the $m$-th and $n$-th rows of $R_C$, respectively. The feature map $H_j$ is input to the adaptive attention module for global pooling, yielding small target features $z_s$. The attention coefficients for each category are computed as:

$$ \mu = \text{softmax}(z_s E_s M^T) $$

where $E_s$ represents the fully connected layer weights, and $M^T$ is the transpose matrix. The enhanced small target features are then:

$$ z’ = P(\mu \otimes \epsilon M E_G) $$

with $P$ as a projection matrix, $\epsilon$ a scaling factor, and $E_G$ a weight matrix. These enhanced features $z’$ are concatenated with the original features $z_s$ and fed back into the CNN for more accurate recognition.

Next, I design a small target intelligent recognition layer that selects and fuses feature maps at different scales to detect variably sized targets. The fusion process is described as:

$$ Z = F_{sq}(H_j) = \frac{1}{L \times H} \sum_{l=1}^{L} \sum_{h=1}^{H} H_j(l, h) $$

where $Z$ is the fused feature map, $F_{sq}(\cdot)$ denotes the fusion operation, and $L$ and $H$ are the length and width of $H_j$, respectively. This multi-scale approach ensures that both large and small targets—common in China UAV drone inspections—are adequately represented. To further refine localization, I optimize the CNN’s loss function. The original CIoU Loss is defined as:

$$ \begin{aligned} L_{\text{CIoU}} &= 1 – \text{CIoU} \\ \text{CIoU} &= \text{IoU} – \frac{\rho^2(b, b^{gt})}{c^2} – \eta v \\ v &= \frac{4}{\pi^2} \left( \arctan \frac{w^{gt}}{k^{gt}} – \arctan \frac{w}{k} \right)^2 \\ \eta &= \frac{v}{(1 – \text{IoU}) + v} \end{aligned} $$

Here, $\text{IoU}$ is the intersection over union between the prediction box $\phi$ and ground truth box $\phi’$, $v$ measures the aspect ratio difference, $b$ and $b^{gt}$ are center coordinates, $w^{gt}$ and $k^{gt}$ are the width and height of $\phi’$, $w$ and $k$ are for $\phi$, $\eta$ is a balance parameter, $\rho^2$ is the Euclidean distance, and $c$ is the diagonal length of the minimal enclosing box. However, CIoU may struggle when width-height ratios are linearly proportional, halting regression optimization. Therefore, I adopt EIoU Loss, which splits the aspect ratio penalty into separate width and height terms:

$$ \Gamma_{\text{EIoU}} = \Gamma_{\text{IoU}} + \Gamma_{\text{dis}} + \Gamma_{\text{asp}} = 1 – \text{IoU} + \frac{\rho^2(b, b^{gt})}{c^2} + \frac{\rho^2(w, w^{gt})}{C_w^2} + \frac{\rho^2(h, h^{gt})}{C_h^2} $$

where $C_w$ and $C_h$ are the width and height of the smallest bounding box covering both $\phi$ and $\phi’$. This decomposition accelerates CNN convergence and improves small target localization accuracy. The final recognition result for power maintenance safety inspection using China UAV drone is obtained by combining the prediction box screening and post-processing:

$$ Q = P(s_i \otimes \Gamma_{\text{EIoU}} d_i \times \sigma(1 – \text{IoU}) Z \{ B_s \}) $$

where $s_i$ is the probability of the $i$-th anchor box belonging to the target class, $d_i$ is the predicted offset, $\sigma$ is the Sigmoid function, and $\{ B_s \}$ is the set of prediction boxes.

To validate the method, I conducted experiments using the VisDrone dataset, which includes aerial images from UAV inspections, many captured in scenarios relevant to China UAV drone applications. The dataset contains images of unsafe behaviors during power maintenance, such as missing safety helmets and improper tool handling. I selected 1,523 images, splitting them into training (1,066 images), validation (152 images), and test (305 images) sets, following a 7:1:2 ratio. The experimental framework was built on PyTorch 1.12.1. For comparison, I implemented two baseline methods: an adaptive collaborative attention mechanism (ACAM) and a high-quality image recognition technique. The evaluation metric was mean Average Precision (mAP) at an IoU threshold of 0.5 (mAP@0.5), which measures detection accuracy across classes.

The small target recognition results on sample images demonstrate that my method accurately identifies unsafe small targets without false positives or misses, outperforming the baselines in visual clarity. Quantitatively, the mAP@0.5 values on the validation set are summarized in Table 1.

Method	mAP@0.5 on Validation Set	Key Characteristics
Proposed CNN with Adaptive Attention	0.78	Adaptive attention module, multi-scale fusion, EIoU Loss
Adaptive Collaborative Attention Mechanism (ACAM)	0.65	Collaborative attention, single-scale detection
High-Quality Image Recognition Technique	0.58	Image enhancement, fixed loss function

As shown, my method achieves the highest mAP@0.5 of 0.78, indicating superior accuracy in small target recognition for China UAV drone inspection scenarios. This improvement stems from the enhanced feature representation and optimized localization. To further analyze performance, I examined the impact of individual components. Table 2 compares mAP@0.5 when ablating parts of the proposed method.

Method Variant	mAP@0.5	Description
Full Proposed Method	0.78	Includes adaptive attention, multi-scale fusion, EIoU Loss
Without Adaptive Attention Module	0.70	Standard CNN with multi-scale fusion and EIoU Loss
Without Multi-Scale Fusion	0.68	Adaptive attention with single-scale detection, EIoU Loss
Without EIoU Loss (using CIoU)	0.72	Adaptive attention, multi-scale fusion, but CIoU Loss

The ablation study confirms that each component contributes to overall performance, with the adaptive attention module providing the most significant boost. The EIoU Loss also enhances convergence, as evidenced by faster training times. For instance, the training loss curve for my method can be modeled as:

$$ L(t) = L_0 e^{-kt} + \epsilon $$

where $L(t)$ is the loss at epoch $t$, $L_0$ is the initial loss, $k$ is the decay rate, and $\epsilon$ is a small constant. With EIoU Loss, $k$ is approximately 0.15 per epoch, compared to 0.10 for CIoU Loss, indicating quicker optimization. In practical terms, this means China UAV drone systems can deploy the model more efficiently for real-time safety monitoring.

The proposed method’s effectiveness is further highlighted in complex environments common to China UAV drone operations, such as varying lighting conditions and occlusions. By leveraging multi-scale features, the model maintains robustness. For example, the feature fusion process ensures that small targets are represented at multiple resolutions, reducing miss rates. The adaptive attention mechanism dynamically adjusts focus based on context, which is crucial for power maintenance scenes where background clutter—like vegetation or equipment—can obscure targets. This aligns with the growing demand for reliable autonomous inspection in China’s power grid infrastructure.

In conclusion, I have developed an intelligent small target recognition method for power maintenance operations using China UAV drone inspection, based on a CNN enhanced with adaptive attention and multi-scale fusion. The approach addresses key challenges in small target detection, such as scale variation and localization accuracy, by integrating an adaptive attention module for feature enhancement, a small target intelligent recognition layer for multi-scale fusion, and an optimized EIoU Loss for precise bounding box regression. Experimental results on the VisDrone dataset demonstrate a mAP@0.5 of 0.78, outperforming existing methods and proving its efficacy in identifying unsafe behaviors during power maintenance. Future work will focus on extending the model to video sequences for dynamic behavior analysis and integrating it with real-time China UAV drone platforms for broader industrial applications. This research contributes to the advancement of UAV-based safety inspection technologies, supporting the sustainable development of power systems.