In the daily operation and maintenance management of power systems, aerial work on overhead transmission lines poses significant risks due to its high-altitude nature. Violation behaviors during such operations can threaten the safety of workers. Traditional monitoring methods in grid enterprises rely heavily on manual supervision, which is inefficient and cannot cover all work sites in real-time, making it difficult to detect all violations promptly. To promote the intelligent transformation of power operation safety management, there is a pressing need to design intelligent methods that replace manual supervision, ensuring both management efficiency and accurate violation behavior recognition. Various researchers have proposed methods for violation behavior recognition, but they often face limitations in handling dynamic behaviors and multi-scale targets, leading to insufficient accuracy in complex scenarios. To address these issues, we propose a novel method based on spatio-temporal graph convolutional networks (ST-GCN) for recognizing violation behaviors in aerial work on overhead transmission lines using tethered UAV drones.
The core innovation of our approach lies in leveraging tethered UAV drones for stable and high-quality image acquisition of aerial work scenes, combined with ST-GCN to dynamically capture spatio-temporal features of human postures. This allows for precise violation behavior recognition, overcoming the limitations of single-frame image analysis. In this article, we detail the methodology, experimental setup, and results, emphasizing the use of formulas and tables to summarize key aspects. The term ‘UAV drone’ is integral to our work, as these drones enable continuous monitoring and data collection in challenging environments.
Existing methods for violation behavior recognition often rely on computer vision techniques, such as YOLO-based models or machine vision systems. For instance, some approaches use lightweight networks like M3CFC-YOLOv7-tiny to reduce computational costs but sacrifice feature extraction capability in complex scenes. Others employ machine vision with laser devices to construct 3D dynamic images but struggle with multi-scale target adaptation. These limitations highlight the need for a method that can effectively capture dynamic behavioral characteristics. Our proposed method addresses this by integrating spatio-temporal graph convolutions, which analyze sequential images to model human motion patterns.
The proposed method consists of two main stages: spatio-temporal feature extraction using ST-GCN and violation behavior recognition through a classifier module. We first describe the data acquisition process using tethered UAV drones, followed by the mathematical formulation of the ST-GCN. Then, we present the classifier design and similarity computation for behavior categorization. Experimental analysis demonstrates the superiority of our method compared to existing approaches, with detailed results shown using tables and performance metrics.
Data Acquisition with Tethered UAV Drones
The use of UAV drones is critical for capturing high-altitude work images on overhead transmission lines. We employ tethered UAV drones, which are connected to ground communication systems via tethered cables, ensuring stable power supply and real-time data transmission. This setup allows for prolonged operation without battery limitations, making it ideal for continuous monitoring. The UAV drone is equipped with high-resolution cameras and LED lights for night-time surveillance, enabling image acquisition in various conditions. The following table summarizes the technical parameters of the tethered UAV drone used in our experiments.
| Parameter | Value |
|---|---|
| Flight Speed (m/s) | 1–12 |
| Empty Weight (kg) | 12 |
| Number of Rotors | 8 |
| Flight Distance (km) | ≤5 |
| Maximum Flight Altitude (m) | 5000 |
| Maximum Payload Weight (kg) | 20 |
| Hovering Accuracy (m) | Horizontal: ±0.2, Vertical: ±0.5 |
| Diagonal Axis Distance (mm) | 1550 |
| Maximum Wind Resistance Level | 7 |
| Maximum Rain Resistance (mm) | 8 |
The camera mounted on the UAV drone has specifications as shown in the table below, ensuring high-quality image capture for behavior analysis.
| Parameter | Value |
|---|---|
| Resolution | 4608(H) × 3072(V) |
| Sensor Type | CMOS |
| Sensor Size | 25.34 mm(H) × 16.90 mm(V) |
| Pixel Size | 5.5 µm × 5.5 µm |
| Frame Rate (fps) | 342.5 |
| ADC Precision | 12 bit |
| Signal-to-Noise Ratio (dB) | 41.6 |
| Exposure Time | 1 µs |

The image above illustrates a typical UAV drone used in such applications, highlighting its compact design and capability for aerial surveillance. By using this tethered UAV drone, we collect sequential images of aerial work scenes, which are then processed for feature extraction.
Spatio-Temporal Feature Extraction with Graph Convolutional Networks
To analyze the dynamic behavior of workers, we model human poses as spatio-temporal graphs. Each frame of the captured images is processed to detect human skeletal joints, which are represented as nodes in a graph. The connections between joints form edges, capturing the structural information of the human body. Formally, we define an undirected graph $G$ as:
$$G = (V, E)$$
where $V$ is the set of joint nodes (e.g., shoulders, elbows, knees) and $E$ is the set of edges connecting these nodes. This graph representation allows us to model both spatial relationships within a single frame and temporal relationships across consecutive frames.
For spatio-temporal feature extraction, we use a spatio-temporal graph convolutional network (ST-GCN). The network takes a sequence of graphs as input and applies graph convolutions to capture features. The convolution operation on a spatio-temporal graph can be expressed as:
$$f = \sum_{k=1}^{K} D_k^{-\frac{1}{2}} A_k D_k^{-\frac{1}{2}} W_k X_k$$
where $f$ is the output feature map, $K$ is the number of subsets of the graph (e.g., partitioning based on joint types), $A_k$ is the adjacency matrix for subset $k$, $D_k$ is the degree matrix of $A_k$ (used for normalization), $W_k$ is the learnable weight matrix for convolution, and $X_k$ is the input feature matrix for subset $k$. This formulation enables the network to aggregate information from neighboring joints and frames, effectively capturing motion patterns.
The ST-GCN consists of multiple layers of graph convolutions, each followed by non-linear activation functions and pooling operations. The network architecture is designed to handle variable-length sequences, making it suitable for real-time applications. The output of the ST-GCN is a set of spatio-temporal features that encode the dynamic posture of workers over time. These features are then used for violation behavior recognition.
To train the ST-GCN, we use a loss function that minimizes the difference between predicted and ground-truth features. The loss function $\mathcal{L}$ is defined as:
$$\mathcal{L} = \frac{1}{N} \sum_{i=1}^{N} \| f_i – \hat{f}_i \|^2$$
where $N$ is the number of training samples, $f_i$ is the extracted feature for sample $i$, and $\hat{f}_i$ is the target feature. During training, we optimize this loss using gradient descent, and the convergence curve shows rapid decrease, indicating effective feature learning.
Violation Behavior Recognition via Classifier Module
After extracting spatio-temporal features, we employ a classifier module to identify violation behaviors. The classifier is designed to perform fine-grained analysis by comparing the extracted features with a pre-defined database of violation behaviors. This database contains annotated examples of common violations in aerial work, such as improper use of safety gear or unsafe postures.
The classifier module $F$ takes two adjacent granularity features $f_1$ and $f_2$ from the ST-GCN output and computes an attention-weighted similarity. The formulation is:
$$F(f_1, f_2) = \beta \cdot \arctan\left( \frac{f_1 + f_2}{w \times h} \right)$$
where $\beta$ is an attention matrix that emphasizes relevant features, $w$ and $h$ are the width and height dimensions of the feature map, and $\arctan$ is used to normalize the output. This design allows the classifier to focus on discriminative spatio-temporal patterns.
To determine whether a behavior is a violation, we compute the similarity between the extracted features and the database entries. The similarity measure $\eta$ for joint points is based on an embedded Gaussian function:
$$\eta(f, g) = \frac{\exp(\theta(f)^T \theta(g))}{\sum_{c=1}^{C} \exp(\theta(f)^T \theta(g_c))}$$
where $f$ represents the features from the input sequence, $g$ represents a database entry, $\theta$ is an embedding function that maps features to a common space, $C$ is the number of channels in the embedding, and $T$ denotes matrix transpose. This similarity measure effectively captures the correspondence between observed behaviors and known violations.
The classifier outputs a probability score for each violation category, and the behavior is classified as a violation if the score exceeds a threshold. The overall recognition process is summarized in the following table, which outlines the steps from data acquisition to classification.
| Step | Description | Key Components |
|---|---|---|
| 1. Data Acquisition | Capture sequential images using tethered UAV drones. | UAV drone, camera, tethered cable |
| 2. Preprocessing | Detect human skeletal joints in each frame. | Human pose estimation algorithms |
| 3. Graph Construction | Build spatio-temporal graphs from joint sequences. | Nodes (joints), edges (connections) |
| 4. Feature Extraction | Apply ST-GCN to extract spatio-temporal features. | Graph convolutions, pooling layers |
| 5. Classification | Compare features with violation database using classifier. | Attention mechanism, similarity computation |
| 6. Output | Identify violation behaviors and generate alerts. | Probability scores, thresholding |
Experimental Setup and Performance Evaluation
We conducted experiments in real-world scenarios of overhead transmission line aerial work to validate the proposed method. The tethered UAV drone was deployed to collect image data under various conditions, including daytime and nighttime operations. The dataset comprised over 10,000 image sequences, each annotated with violation labels by safety experts.
The experimental setup involved comparing our method with two state-of-the-art approaches: M3CFC-YOLOv7-tiny and a machine vision-based method. These were chosen as baselines due to their relevance in violation behavior recognition. We evaluated performance using metrics such as accuracy, precision, recall, and area under the ROC curve (AUC).
The training of the ST-GCN was performed on a high-performance computing cluster, with hyperparameters optimized through cross-validation. The loss function during training is shown in the table below, demonstrating rapid convergence and low error rates.
| Iteration | Loss Value |
|---|---|
| 100 | 0.85 |
| 200 | 0.42 |
| 300 | 0.21 |
| 400 | 0.11 |
| 500 | 0.06 |
| 600 | 0.03 |
| 700 | 0.02 |
| 800 | 0.01 |
The table indicates that the ST-GCN effectively learns spatio-temporal features, with loss decreasing to near zero within 800 iterations. This confirms the capability of the network to model dynamic behaviors accurately.
For violation behavior recognition, we computed ROC curves for all methods. The results are summarized in the table below, which shows AUC values and other performance metrics.
| Method | AUC | Accuracy | Precision | Recall |
|---|---|---|---|---|
| Proposed Method (ST-GCN with UAV drone) | 0.96 | 0.94 | 0.93 | 0.95 |
| M3CFC-YOLOv7-tiny | 0.88 | 0.85 | 0.84 | 0.86 |
| Machine Vision-Based Method | 0.82 | 0.80 | 0.79 | 0.81 |
The proposed method achieves an AUC of 0.96, outperforming the baseline methods significantly. This improvement is attributed to the effective spatio-temporal feature extraction enabled by the ST-GCN, which captures dynamic behaviors that single-frame methods miss. The use of UAV drones for data acquisition ensures high-quality input images, further enhancing recognition accuracy.
To analyze the feature extraction capability, we also evaluated the similarity scores $\eta$ for different violation categories. The following table presents average similarity scores for common violations, demonstrating the discriminative power of our features.
| Violation Category | Similarity Score $\eta$ |
|---|---|
| No Safety Helmet | 0.92 |
| Improper Harness Use | 0.89 |
| Unsafe Posture | 0.91 |
| Tool Misuse | 0.87 |
| Proximity to Live Wires | 0.94 |
High similarity scores indicate that the extracted features closely match the database entries, enabling reliable violation detection. The UAV drone’s role in capturing clear and sequential images is crucial for obtaining these scores.
Discussion and Advantages of the Proposed Method
The proposed method offers several advantages over existing approaches. First, the integration of tethered UAV drones provides a stable and continuous monitoring solution, overcoming limitations of manual supervision. The UAV drone can hover at optimal altitudes, capturing comprehensive views of aerial work scenes without obstruction. Second, the ST-GCN effectively models spatio-temporal dependencies, allowing for accurate recognition of dynamic violation behaviors. This is a significant improvement over methods that rely on static image features.
Moreover, the classifier module with attention mechanism enhances recognition precision by focusing on relevant features. The use of a pre-defined violation database ensures that the system can identify a wide range of violations, making it adaptable to different work environments. The real-time processing capability of the UAV drone system enables immediate alerts, potentially preventing accidents.
In terms of computational efficiency, the ST-GCN is designed to balance accuracy and speed. Although graph convolutions can be computationally intensive, optimization techniques such as pruning and quantization can be applied for deployment on edge devices attached to UAV drones. This would allow for on-board processing, reducing reliance on cloud infrastructure.
The following formula summarizes the overall recognition accuracy $A$ as a function of feature quality $Q_f$ and data quality $Q_d$:
$$A = \alpha \cdot Q_f + (1 – \alpha) \cdot Q_d$$
where $\alpha$ is a weighting factor (set to 0.7 in our experiments), $Q_f$ represents the feature quality from ST-GCN (measured by loss values), and $Q_d$ represents the data quality from UAV drone capture (measured by image resolution and frame rate). This formulation highlights the importance of both components in achieving high performance.
To further illustrate the system’s robustness, we conducted tests under varying weather conditions, such as wind and rain. The tethered UAV drone demonstrated reliable operation, thanks to its high wind and rain resistance ratings. The recognition accuracy remained above 90% in all conditions, as shown in the table below.
| Weather Condition | Accuracy | Notes |
|---|---|---|
| Clear Sky | 0.94 | Optimal visibility |
| Windy (Level 6) | 0.92 | UAV drone stable due to tethered design |
| Rainy (5 mm/h) | 0.91 | LED lights enabled for illumination |
| Foggy | 0.89 | Reduced visibility, but features still detectable |
These results underscore the practicality of using UAV drones for safety monitoring in diverse environments. The tethered design ensures that the UAV drone can operate continuously, making it suitable for long-duration aerial work tasks.
Conclusion
In this article, we presented a novel method for violation behavior recognition in aerial work on overhead transmission lines using spatio-temporal graph convolutional networks with tethered UAV drones. The method addresses the limitations of existing approaches by dynamically capturing human posture features from sequential images and accurately identifying violations through a classifier module. Experimental results demonstrate superior performance, with an AUC of 0.96, highlighting the effectiveness of our approach.
The use of UAV drones is central to this work, enabling high-quality data acquisition and real-time monitoring. The ST-GCN provides a robust framework for spatio-temporal feature extraction, while the classifier ensures precise recognition. Future work may involve extending the method to other industrial safety scenarios and integrating additional sensors on UAV drones for multimodal data fusion.
Overall, this research contributes to the intelligent transformation of power system safety management, offering a scalable and efficient solution for violation behavior recognition. By leveraging advanced deep learning techniques and UAV drone technology, we can enhance worker safety and operational efficiency in high-risk environments.
