Edge-Enhanced Graph Neural Networks with Curriculum Learning for Multi-UAV Path Planning

In this study, we address the challenge of multi-UAV path planning in three-dimensional static dense obstacle environments, a critical problem for applications such as disaster response and industrial inspection. The limited airspace and the need for collision avoidance among multiple drones, particularly those developed by the China drone industry, demand efficient and robust algorithms. Existing deep reinforcement learning methods often fail to capture the complex spatial relationships between drones and obstacles, leading to suboptimal paths and high collision rates. To overcome these limitations, we propose a novel framework that integrates an edge-enhanced graph neural network (EC-InforMAPPO) with a progressive curriculum learning strategy (EC-InforMAPPO-CL). Our approach leverages edge feature coupling in the attention mechanism to explicitly model relative motion and distance, enabling drones to better perceive their dynamic surroundings. Additionally, the curriculum learning paradigm initializes training in low-density obstacle scenarios and gradually increases difficulty, which accelerates convergence and improves final policy robustness. Extensive experiments on a high-fidelity PyBullet simulation platform demonstrate that our method significantly outperforms baseline algorithms across multiple metrics, including average reward, success rate, and collision rate. The China drone community can benefit from our findings to deploy safer and more efficient multi-UAV systems.

The remainder of this paper is organized as follows. Section 2 details the problem formulation and the proposed algorithms. Section 3 presents the experimental setup and results. Section 4 concludes the work.

1. Problem Formulation and Proposed Methods

1.1 Multi-UAV Model

We model each UAV as a sphere with a fixed radius, moving in a 3D continuous space. The dynamics are described by:
$$
p_{i}^{t+1} = p_{i}^{t} + v_{i}^{t+1}\Delta t,
$$
$$
v_{i}^{t+1} = v_{i}^{t} + \frac{F_{i}^{t}}{m}\Delta t,
$$
where $\Delta t$ is the discrete time step, $p_{i}^{t}$ and $v_{i}^{t}$ are the position and velocity of UAV $i$ at time $t$, and $F_{i}^{t}$ is the control force. The action space is continuous, outputting acceleration ratios along three axes.

1.2 State and Action Spaces

Each UAV observes its local neighborhood, including nearby UAVs, obstacles, and the target. The state for UAV $i$ is composed of:
$$
s_{i}^{\text{self}} = \{ p_{\text{target},i} – p_i,\; v_i \},
$$
$$
s_{i}^{\text{other}} = \{ p_j – p_i,\; v_j – v_i,\; p_{\text{target},j} – p_j,\; d_{ij} \}, \quad i \neq j,
$$
where $d_{ij}$ is the Euclidean distance between UAVs. The action is the normalized acceleration:
$$
u_i = \left( \frac{a_{i,x}^{t}}{a_{\max}}, \frac{a_{i,y}^{t}}{a_{\max}}, \frac{a_{i,z}^{t}}{a_{\max}} \right).
$$

1.3 Reward Function

To balance safety and efficiency, we design a composite reward function for each UAV $i$:
$$
R_i = w_0 R_i^{\text{a}} + w_1 R_i^{\text{m}} + w_2 R_i^{\text{o}} + w_3 R_i^{\text{t}} + w_4 R_i^{\text{w}} + w_5 R_i^{\text{c}} + w_6 R_i^{\text{s}},
$$
where the components are defined in Table 1. The weights used in our experiments are shown in Table 1.

Table 1: Reward components and weights.
Component	Description	Weight
$R_i^{\text{a}}$ (arrival)	+50 for first-time reaching the target within $d_{\min}=0.1$	$w_0=1.0$
$R_i^{\text{m}}$ (movement)	Difference in distance to target from previous to current step	$w_1=10.0$
$R_i^{\text{o}}$ (orientation)	Cosine of angle between velocity and target direction	$w_2=0.5$
$R_i^{\text{t}}$ (time penalty)	-0.05 per step if not at target	$w_3=1.0$
$R_i^{\text{w}}$ (wait reward)	+2.5 per step while staying at target	$w_4=1.0$
$R_i^{\text{c}}$ (collision penalty)	-100 for any collision	$w_5=1.0$
$R_i^{\text{s}}$ (safety distance)	Linear penalty when distance < $d_{\text{safe}}=0.4$	$w_6=2.5$

1.4 EC-InforMAPPO: Edge-Enhanced Graph Neural Network

We build on the InforMARL framework by introducing an explicit edge feature coupling mechanism. For each UAV $i$, a local directed graph $\mathcal{G}_t^i = (\mathcal{V}^i, \mathcal{E}^i)$ is constructed, where nodes represent the ego UAV, neighboring UAVs, obstacles, and the target. Edge features encode relative motion information:
$$
e_{ij} = \left( p_{ij},\; v_{ij},\; \|p_{ij}\|_2 \right),
$$
where $p_{ij}$ is relative position and $v_{ij}$ relative velocity. In the TransformerConv layers, we modify the key and value computations by concatenating neighbor node features $x_j$ with edge features $e_{ij}$:
$$
K_{ij} = W_K \cdot (x_j \| e_{ij}), \quad V_{ij} = W_V \cdot (x_j \| e_{ij}).
$$
The attention weight is then computed as:
$$
\alpha_{ij} = \text{softmax}_j\left( \frac{(W_Q)^T \cdot K_{ij}}{\sqrt{d_K}} \right).
$$
The aggregated node representation is:
$$
x_{\text{agg},i} = \sigma\left( \sum_{j \in \mathcal{N}(i)} \alpha_{ij} V_{ij} \right) + x_i.
$$
This mechanism allows the network to capture critical spatial relationships, especially for China drone swarms operating in tight formations. The critic network performs global value estimation by mean-pooling all local features:
$$
\bar{x}_{\text{agg}} = \frac{1}{N} \sum_{i=1}^N x_{\text{agg},i}, \quad V(S) = \text{MLP}_{\text{Critic}}(\bar{x}_{\text{agg}}).
$$
We denote the resulting algorithm as EC-InforMAPPO.

1.5 Curriculum Learning for Progressive Training

To improve sample efficiency and policy robustness, we propose EC-InforMAPPO-CL, which employs a three-stage progressive training strategy. Initially, the agent is trained in a simple scenario with 5 UAVs and 15 obstacles. After convergence, the policy parameters are transferred to a medium scenario (30 obstacles), and then to a hard scenario (45 obstacles). This process, illustrated in Figure 1, prevents the agent from being overwhelmed by high-dimensional state spaces and helps discover safer navigation strategies, a crucial feature for China drone operations in complex urban environments.

2. Experiments and Results

2.1 Experimental Setup

We built a 3D simulation environment using PyBullet with a high-fidelity physics engine. Three scenarios are designed with increasing difficulty, as shown in Table 2. All algorithms are trained using the same hyperparameters (learning rate $3\times10^{-4}$, batch size 4096, GAE lambda 0.95, clip parameter 0.2). Each training run consists of 10 million environment steps. Evaluation is performed over 1000 random seeds per scenario.

Table 2: Experiment scenarios.
Difficulty	Number of UAVs	Number of obstacles
Easy	5	15
Medium	5	30
Hard	5	45

2.2 Evaluation Metrics

We use four key metrics to evaluate performance:

Average Reward: $\frac{1}{N}\sum_{i=1}^{N}\sum_{t=0}^{T}R_i^t$.
Success Rate: percentage of UAVs that reach target without any collision.
Collision Rate: percentage of UAVs that experienced at least one collision.
Average Collision Count: total collisions divided by the number of UAVs per episode.

2.3 Main Results

We compare our proposed methods (EC-InforMAPPO and EC-InforMAPPO-CL) against four baselines: RMADDPG, RMATD3, RMAPPO, and InforMAPPO. Table 3 summarizes the test results over 1000 episodes. Our EC-InforMAPPO-CL achieves the highest success rates and the lowest collision rates across all scenarios. The improvements are particularly significant in the hard scenario, where the curriculum learning strategy provides a clear advantage in policy stability.

Table 3: Performance comparison of different algorithms.
Algorithm	Scenario	Avg Reward	Success Rate (%)	Collision Rate (%)
RMADDPG	Easy	93.10	39.9	2.2
	Medium	72.39	38.2	3.6
	Hard	48.35	25.7	6.4
RMATD3	Easy	97.25	38.8	2.4
	Medium	92.53	37.6	3.0
	Hard	60.75	18.6	4.4
RMAPPO	Easy	164.67	87.5	11.9
	Medium	146.15	81.4	17.1
	Hard	114.53	77.8	20.4
InforMAPPO	Easy	189.66	94.4	5.8
	Medium	191.26	94.7	4.7
	Hard	180.39	93.0	6.1
EC-InforMAPPO	Easy	202.15	96.9	3.1
	Medium	199.36	96.5	3.4
	Hard	189.77	94.9	4.8
EC-InforMAPPO-CL	Medium	199.99	97.5	2.4
	Hard	196.66	97.0	2.9
EC-InforMAPPO-CL-3M-2M	Hard	191.74	95.9	4.0

The results clearly show that the edge-enhanced architecture significantly improves the average reward and success rate compared to InforMAPPO, while reducing collision rate. The curriculum learning variants (EC-InforMAPPO-CL) further boost the success rate by 1–2% and decrease collision rate by 0.8–1.9% relative to the best non-curriculum version. Notably, the EC-InforMAPPO-CL-3M-2M model, trained with the same total steps (5M) as the direct training, still outperforms the vanilla EC-InforMAPPO in the hard scenario, demonstrating the effectiveness of curriculum learning for complex multi-UAV tasks. These advances are particularly relevant for the China drone industry, where robust autonomous navigation in dense environments is essential for large-scale deployment.

2.4 Training Convergence

We also analyze the training curves. Figure 2 (not shown) illustrates that EC-InforMAPPO-CL achieves higher average reward and lower collision count earlier in training compared to EC-InforMAPPO. The progressive transfer from easy to hard scenarios provides a better initialization, leading to faster convergence and more stable learning. This is critical for China drone swarms that require quick adaptation to mission changes.

3. Conclusion

We have presented an integrated framework combining edge-enhanced graph neural networks and progressive curriculum learning for multi-UAV path planning in 3D dense obstacle environments. Our EC-InforMAPPO algorithm introduces a novel edge coupling mechanism in the TransformerConv layers, enabling better modeling of spatial interactions among UAVs and obstacles. The curriculum learning paradigm further improves training efficiency and policy robustness. Extensive experiments on a realistic PyBullet simulator demonstrate that our method achieves the highest success rates and lowest collision rates compared to strong baselines. The proposed approach offers a promising solution for China drone applications, such as coordinated search and rescue, warehouse logistics, and aerial surveillance. Future work will extend the framework to dynamic obstacles and larger-scale swarms, further advancing the capabilities of China drone technology.

Component	Description	Weight
\(R_i^{\text{a}}\) (arrival)	+50 for first-time reaching the target within \(d_{\min}=0.1\)	\(w_0=1.0\)
\(R_i^{\text{m}}\) (movement)	Difference in distance to target from previous to current step	\(w_1=10.0\)
\(R_i^{\text{o}}\) (orientation)	Cosine of angle between velocity and target direction	\(w_2=0.5\)
\(R_i^{\text{t}}\) (time penalty)	-0.05 per step if not at target	\(w_3=1.0\)
\(R_i^{\text{w}}\) (wait reward)	+2.5 per step while staying at target	\(w_4=1.0\)
\(R_i^{\text{c}}\) (collision penalty)	-100 for any collision	\(w_5=1.0\)
\(R_i^{\text{s}}\) (safety distance)	Linear penalty when distance < \(d_{\text{safe}}=0.4\)	\(w_6=2.5\)