Multi-UAV Task Allocation and Path Optimization via Deep Reinforcement Learning for Power Inspection

In this work, we present a novel deep reinforcement learning framework that integrates a consensus-based bundle algorithm with a graph isomorphism network to solve the coupled task allocation and path planning problem for multiple China UAV drones performing collaborative power line inspections. Our approach addresses the inherent difficulties of modeling complex topological relationships among inspection nodes and the lack of generalization in traditional methods. By designing a joint embedding representation and an attention-driven policy network, we efficiently capture the cooperative relationships between nodes and drones, enabling end-to-end decision making. Extensive experiments demonstrate that our method significantly improves solution quality and computational efficiency, especially in large-scale scenarios, providing a scalable and practical solution for multi-UAV inspection systems deployed in China’s power grid.

1. Introduction

Power line inspection is a critical activity for ensuring the stable operation of electrical grids. Traditional manual inspection suffers from low efficiency, safety risks, and terrain limitations. The adoption of unmanned aerial vehicles (UAVs) has become mainstream, but the limited endurance of a single drone necessitates a vehicle-UAV collaborative paradigm. In this context, task allocation and path planning for multiple China UAV drones become the core challenge for system-level optimization. This paper focuses on developing an intelligent decision-making method that can handle the dynamic and complex inspection environment, where a fleet of homogeneous China UAV drones departs from a moving vehicle, visits all assigned towers, and returns. We propose a hybrid framework that combines the distributed consensus-based bundle algorithm (CBBA) for initial task allocation with a deep reinforcement learning (DRL) planner enhanced by graph isomorphism network (GIN) embeddings and attention mechanisms. The goal is to minimize the maximum path length while ensuring all tasks are completed efficiently. Our method achieves state-of-the-art performance on both synthetic and real-world datasets, demonstrating superior generalization and scalability.

2. Problem Formulation

We consider a scenario where a set of N homogeneous China UAV drones (denoted as UAV set $U = \{1,2,\ldots,N\}$) must inspect M power towers (nodes) distributed in a predefined area. Each drone starts and ends at the same depot (vehicle) location. The problem is to assign each tower to exactly one drone and to plan the tour for each drone such that the total mission completion time (or maximum path length) is minimized. We make the following assumptions: drones fly at constant speed, battery capacity is sufficient for the planned routes, communication between drones and the depot is reliable, and the depot remains stationary during the inspection.

The task allocation model can be formulated as:

$$ \max \sum_{i=1}^{N} \left[ \sum_{j=1}^{M} c_{ij}(x_i, p_i) x_{ij} \right] $$

where $c_{ij}(x_i, p_i)$ is the benefit of assigning tower $j$ to drone $i$ given its position $x_i$ and orientation $p_i$, and $x_{ij}$ is a binary decision variable. The constraints include:

$$ \sum_{j=1}^{M} x_{ij} \leq L_m,\quad \forall i $$
$$ \sum_{i=1}^{N} \sum_{j=1}^{M} x_{ij} = M $$
$$ x_{ij} \in \{0,1\} $$
$$ V = \lambda (W + m + m L_m) \leq Q $$

Here $L_m$ is the maximum number of tasks per drone, $V$ is the total energy consumption, $\lambda$ is the energy coefficient, $W$ is the drone weight, $m$ is basic equipment mass, and $Q$ is battery capacity. This model ensures each task is assigned exactly once and respects the drone’s payload and energy limits.

3. Consensus-Based Bundle Algorithm for Pre-Allocation

To obtain an initial feasible task allocation that respects communication and computational constraints, we employ a distributed CBBA. Each drone maintains its own bundle of tasks $B_i$, path sequence $P_i$, and communication timestamps $S_i$. The algorithm iteratively performs two phases: bundle construction through greedy insertion, and conflict resolution via consensus among neighbors. The local decision for drone $i$ regarding task $k$ is represented by:

$$ z_{kj} = \begin{cases} \text{update: } z_{ki};\ y_{kj}=y_{ki} \\ \text{reset: } z_{kj}=\emptyset;\ y_{kj}=0 \\ \text{leave: } z_{kj}=z_{kj};\ y_{kj}=y_{kj} \end{cases} $$

where $z_{kj}$ is the allocation result of drone $j$ for task $k$, and $y_{kj}$ is the corresponding bid. Through repeated rounds of communication, the algorithm converges to a conflict-free assignment. This distributed approach reduces the computational burden on a central unit and is well-suited for large-scale China UAV drone fleets.

4. Deep Reinforcement Learning for Path Optimization

Once tasks are pre-assigned, the path for each drone is optimized using a DRL framework. The core of our architecture is a Graph Isomorphism Network (GIN) that extracts node features from the inspection graph. The GIN module encodes each tower node $i$ into an embedding $h_{ti}$, and the depot node into $h_d$. A global graph embedding $h_g$ is obtained by averaging all tower embeddings:

$$ h_g = \frac{1}{M} \sum_{i=1}^{M} h_{ti} $$

The joint embedding is formed as:

$$ h_c = [h_g : h_d] $$

We then compute drone embedding using an attention mechanism where keys and values come from tower embeddings, and the query is derived from the joint embedding:

$$ k_i = \theta_k h_{ti},\quad v_i = \theta_v h_{ti},\quad q = \theta_q h_c $$
$$ u_i = \frac{q^T k_i}{\sqrt{d_k}},\quad \omega_i = \frac{e^{u_i}}{\sum_j e^{u_j}} $$
$$ h_a = \sum_i \omega_i v_i $$

The attention weights $\omega_i$ dynamically capture the relevance of each tower to the current drone state. Next, a second attention layer computes the probability of visiting each remaining tower:

$$ k_i’ = \theta_{k’} h_i,\quad q’ = \theta_{q’} h_a $$
$$ u_i’ = \frac{q’^T k_i’}{\sqrt{d_k’}},\quad P_i = C \cdot \tanh(u_i’) $$
$$ p_i = \frac{e^{P_i}}{\sum_{j} e^{P_j}} $$

where $p_i$ is the probability that the drone selects tower $i$ as the next destination. We use a standard reinforcement learning objective with the reward defined as the negative of the maximum path length across all drones. The policy gradient is:

$$ \nabla_\theta L = \mathbb{E} \left[ \sum_{t} \nabla_\theta \log \pi_\theta(a_t|s_t) R(\lambda) \right] $$

The training algorithm is summarized in the table below.

**Training Procedure for the Policy Network**
Input	Policy network $\pi_\theta$, episodes $E$, batch size $B$, max steps $T$, learning rate $\alpha$, discount factor $\gamma$
Output	Optimized policy $\pi_\theta’$
1	Randomly initialize $\theta$
2	For epoch=1 to $E$ do
3	Initialize buffer $P$ as empty list
4	For instance=1 to $B$ do
5	Initialize state $s_0$, $t=0$
6	Initialize trajectory $p$ empty
7	While $t < T$ do
8	Sample action $a_t \sim \pi_\theta(a_t\|s_t)$
9	Obtain reward $r_t$ and next state $s_{t+1}$
10	Store $(s_t, a_t, r_t, s_{t+1})$ in $p$
11	$t \leftarrow t+1$
12	End while
13	Append $p$ to $P$
14	End for
15	Compute gradient $\nabla_\theta L$ from $P$
16	Update parameters: $\theta \leftarrow \theta + \alpha \nabla_\theta L$
17	End for

The training environment uses randomly generated tower positions uniformly distributed in $(0,1)^2$. We test two configurations: 3 UAVs + 30 towers and 6 UAVs + 60 towers. Hyperparameters include learning rate $1\times10^{-4}$, batch size 512, Adam optimizer, and 2500 iterations. The training curves show that both the average path length and loss converge after about 2000 iterations, validating the effectiveness of the DRL approach for China UAV drones.

5. Experimental Results

We evaluate our method on both random synthetic datasets and real-world inspection scenarios. The baseline is the OR-Tools solver with a time limit of 1800 seconds. The number of towers varies from 200 to 500, with fixed 3 or 6 drones. The following tables compare path lengths (normalized units) achieved by our method versus OR-Tools.

**Comparison of Path Length for 3 UAVs (Random Data)**
Method	200 towers	300 towers	400 towers	500 towers
OR-Tools	5.698	6.547	7.512	8.438
Our method	4.328	5.075	5.744	6.186

**Comparison of Path Length for 6 UAVs (Random Data)**
Method	200 towers	300 towers	400 towers	500 towers
OR-Tools	5.711	7.122	7.749	9.058
Our method	3.044	3.511	3.941	4.198

For the 3-drone case with 500 towers, our method reduces the path length by approximately 26.7% compared to OR-Tools. For 6 drones and 500 towers, the reduction reaches 53.7%. These results demonstrate that our DRL framework scales much better with problem size and consistently outperforms the classical solver.

We also test on a real-world inspection region partitioned into 16 subareas via K-means clustering. Table 4 shows the path lengths for selected representative areas and the entire region (ALL).

**Path Length Comparison on Real-World Data**
Method	Area 3	Area 6	Area 10	All area
OR-Tools	0.631	0.479	0.998	9.231
Genetic algorithm	1.211	2.088	4.888	–
Our method	0.355	0.199	0.511	4.221

Our method achieves a path length of 4.221 for the entire area, which is 54.3% shorter than OR-Tools (9.231). The improvement highlights the ability of our framework to avoid local optima by jointly learning task allocation and routing. The computational time for our method is consistently below 0.5 seconds across all areas, whereas OR-Tools often hits the 1800-second limit. This dramatic speedup comes from the feedforward nature of the trained neural network, which eliminates the need for iterative search at test time. The scalability experiments (Figures 8 and 9 in the original study) confirm that as the number of towers increases, our method maintains stable solution quality while OR-Tools degrades quickly.

6. Conclusion

In this paper, we proposed a deep reinforcement learning framework that integrates CBBA and GIN for multi-UAV task allocation and path optimization in power line inspection using China UAV drones. The GIN encoder effectively captures topological relationships among towers, and the attention-driven policy network enables adaptive decision making. Experimental results on random and real-world datasets show that our method significantly outperforms OR-Tools in both solution quality (up to 53.7% reduction in path length) and computational efficiency (from hours to seconds). The proposed approach offers a highly scalable and practical solution for China’s smart grid inspection, where a fleet of China UAV drones must coordinately patrol thousands of towers. Future work will focus on dynamic environment adaptation, such as handling unexpected failures or recharging, and extending to heterogeneous drone fleets. The integration of incremental learning and edge computing could further enhance the real-time applicability of this method in the field.