In the domain of power grid inspection, the deployment of multi-UAV systems has become increasingly prevalent across China. Our research focuses on addressing the critical challenges of task allocation and path planning for collaborative China UAV drone fleets performing power line and tower inspections. We propose an innovative framework that integrates a consensus-based bundle algorithm (CBBA) with a graph isomorphism network (GIN) within a deep reinforcement learning paradigm. This approach effectively overcomes the limitations of traditional methods in capturing complex topological relationships and generalizing to large-scale scenarios.
The rapid expansion of China’s power infrastructure demands efficient inspection solutions. Traditional manual inspection methods are constrained by safety hazards, terrain limitations, and low throughput. The adoption of China UAV drone technology has transformed this landscape, yet single-UAV endurance limitations necessitate multi-vehicle collaboration. The core difficulty lies in simultaneously solving task allocation and path optimization under dynamic constraints, such as battery capacity, communication reliability, and heterogeneous mission requirements.
Problem Formulation
We model the multi-UAV cooperative inspection problem as a combined task allocation and routing optimization. Consider a set of N homogeneous China UAV drones deployed from a base station. The inspection area contains M power poles (nodes) requiring visitation. Each drone must depart from the base, visit a subset of nodes, and return. The objective is to minimize the maximum path length among all drones while ensuring complete coverage. The mathematical formulation is given by:
$$ \max \sum_{i=1}^{N} \sum_{j=1}^{M} c_{ij}(x_i, p_i) x_{ij} $$
subject to constraints:
$$ \sum_{j=1}^{M} x_{ij} \leq L_m, \quad \forall i $$
$$ \sum_{i=1}^{N} \sum_{j=1}^{M} x_{ij} = M $$
$$ V \leq Q $$
$$ x_{ij} \in \{0,1\} $$
$$ V = \lambda (W + m + m L_m) $$
Here, \(c_{ij}\) represents the benefit of drone i executing task j given its position \(x_i\) and attitude \(p_i\), \(L_m\) is the maximum task capacity per drone, V is total energy consumption, Q is battery capacity, and \(\lambda\) is the energy coefficient.
Multi-UAV Task Pre-allocation via CBBA
To handle the combinatorial complexity, we employ the consensus-based bundle algorithm (CBBA) as a distributed auction mechanism. Each China UAV drone maintains a local bundle of assigned tasks:
$$ B_i = \{ b_i^1, b_i^2, \dots, b_i^{L_i} \} $$
where \(L_i\) is the maximum number of tasks for drone i. The algorithm iterates through three phases: bundle construction, communication, and conflict resolution. During communication, drones exchange bids and update their local lists using consensus rules:
$$ z_{kj} =
\begin{cases}
\text{update: } & z_{ki}, \quad y_{kj} = y_{ki} \\
\text{reset: } & z_{kj} = \emptyset, \quad y_{kj} = 0 \\
\text{leave: } & z_{kj} = z_{kj}, \quad y_{kj} = y_{kj}
\end{cases} $$
where \(z_{kj}\) is the assignment result and \(y_{kj}\) is the current bid for task k by drone j. This distributed mechanism ensures low communication overhead and scalability, making it suitable for large-scale China UAV drone fleets operating in complex environments.
Deep Reinforcement Learning Framework for Path Planning
After task allocation, we optimize the visiting sequence of each drone using a deep reinforcement learning architecture centered on graph isomorphism networks (GIN) and attention mechanisms. The framework is illustrated conceptually in the training pipeline shown below.

The input consists of node coordinates and a depot node. GIN generates embeddings for each tower node \(h_{t_i}\) and the depot node \(h_d\). We then compute a global graph embedding via average pooling:
$$ h_g = \frac{1}{M} \sum_{i=1}^{M} h_{t_i} $$
The joint embedding combines global and depot information:
$$ h_c = [h_g : h_d] $$
An attention-driven policy network computes query, key, and value vectors:
$$ k_i = \theta_k h_{t_i}, \quad v_i = \theta_v h_{t_i}, \quad q = \theta_q h_c $$
Attention scores are:
$$ u_i = \frac{q^T k_i}{\sqrt{d_k}} $$
Softmax normalizes to probabilities:
$$ \omega_i = \frac{e^{u_i}}{\sum_j e^{u_j}} $$
The drone embedding is then:
$$ h_a = \sum_i \omega_i v_i $$
Finally, we compute the probability of visiting each remaining node using a second attention mechanism:
$$ k_i’ = \theta_{k’} h_i, \quad q’ = \theta_{q’} h $$
$$ u_i’ = \frac{q’^T k_i’}{\sqrt{d_k’}} $$
$$ P_i = C \tanh(u_i’) $$
$$ p_i = \frac{e^{P_i}}{\sum_{j=1}^{n} e^{P_j}} $$
Training employs policy gradient with reward defined as the negative of the maximum tour length. The gradient update is:
$$ \nabla_\theta L = \mathbb{E}\left[ \sum_{t} \nabla_\theta \log \pi_\theta(a_t|s_t) R(\lambda) \right] $$
where \(R(\lambda)\) is the discounted return. The complete training procedure is summarized in Table 1.
Table 1: Training Procedure for Policy Network
| Step | Description |
|---|---|
| Input | Policy network \(\pi_\theta\), episodes E, batches B, max steps T, learning rate \(\alpha\), discount \(\gamma\) |
| 1 | Randomly initialize network parameters \(\theta\) |
| 2 | For epoch = 1 to E: |
| 3 | Initialize empty experience buffer P |
| 4 | For instance = 1 to B: |
| 5 | Initialize environment state \(s_0\), t=0 |
| 6 | Initialize empty trajectory p |
| 7 | While t < T: |
| 8 | Sample action \(a_t \sim \pi_\theta(a_t|s_t)\) |
| 9 | Obtain reward \(r_t\) and next state \(s_{t+1}\) |
| 10 | Store \((s_t, a_t, r_t, s_{t+1})\) in p |
| 11 | t ← t+1 |
| 12 | Append p to P |
| 13 | Compute policy gradient \(\nabla_\theta L\) from P |
| 14 | Update \(\theta \leftarrow \theta + \alpha \nabla_\theta L\) |
| Output | Optimized policy network \(\pi_\theta’\) |
Experimental Results
Training Configuration
We conducted experiments on a system with Intel i7-12700H CPU and RTX 4060 GPU. Training datasets were randomly generated with nodes uniformly distributed in (0,1)². Two scenarios were configured: 3 drones + 30 nodes, and 6 drones + 60 nodes. Hyperparameters: learning rate 1×10⁻⁴, batch size 512, Adam optimizer, 2500 iterations. Training curves show convergence after approximately 2000 iterations. The average path length decreased steadily, with the 6-drone scenario exhibiting smoother convergence but requiring longer computation per iteration.
Comparison with OR-Tools on Random Datasets
We benchmarked our method against OR-Tools, a state-of-the-art commercial solver, with a maximum runtime limit of 1800 seconds. Tables 2 and 3 present the average path length for 3-drone and 6-drone configurations across different node counts.
Table 2: Random Dataset – 3 UAVs Path Length
| Number of Nodes | OR-Tools | Our Method |
|---|---|---|
| 200 | 5.698 | 4.328 |
| 300 | 6.547 | 5.075 |
| 400 | 7.512 | 5.744 |
| 500 | 8.438 | 6.186 |
Table 3: Random Dataset – 6 UAVs Path Length
| Number of Nodes | OR-Tools | Our Method |
|---|---|---|
| 200 | 5.711 | 3.044 |
| 300 | 7.122 | 3.511 |
| 400 | 7.749 | 3.941 |
| 500 | 9.058 | 4.198 |
Our method consistently outperforms OR-Tools. For 3 UAVs at 500 nodes, path length reduced by 26.7% (8.438 vs 6.186). For 6 UAVs at 500 nodes, reduction reached 53.7% (9.058 vs 4.198). These results demonstrate the scalability of our GIN-based approach for large-scale China UAV drone operations.
Real-world Inspection Scenario
We applied K-means clustering to partition a real power inspection area into 16 subregions with varying pole counts. Table 4 compares our method with OR-Tools and genetic algorithm across selected subregions and the full area (ALL).
Table 4: Path Length Comparison on Real Scenario
| Method | Region 3 | Region 6 | Region 10 | All |
|---|---|---|---|---|
| OR-Tools | 0.631 | 0.479 | 0.998 | 9.231 |
| Genetic Algorithm | 1.211 | 2.088 | 4.888 | – |
| Our Method | 0.355 | 0.199 | 0.511 | 4.221 |
For the entire inspection area, our method achieved a path length of 4.221, which is 54.3% shorter than OR-Tools (9.231). The genetic algorithm failed to converge for the full area. This real-world validation confirms the practical superiority of our approach for China UAV drone inspection tasks.
Runtime Efficiency
Figure 8 (described textually) shows that as node count increases, our method’s path length grows only moderately, while OR-Tools’ degrades significantly. Figure 9 illustrates runtime scaling: our method completes within seconds, whereas OR-Tools hits the 1800-second limit for large instances. For real subregions, our runtime stays below 0.5 seconds, as shown in Figure 10. This dramatic improvement stems from the forward-pass efficiency of the trained deep network, avoiding iterative combinatorial search.
Conclusion
We have presented a novel deep reinforcement learning framework that integrates CBBA for task allocation and GIN-based attention for path optimization of multi-UAV China UAV drone systems. The method overcomes traditional limitations in capturing complex node-topology relationships and provides excellent scalability. Experimental results on random and real datasets demonstrate that our approach reduces path length by up to 53.7% compared to OR-Tools while cutting computation time from 1800 seconds to under 10 seconds, and in real scenarios to below 0.5 seconds. This work offers a highly scalable and practical solution for collaborative China UAV drone power inspection, with potential extensions to dynamic environments and heterogeneous fleets.
