Multi-UAV Path Planning via Edge-Enhanced Graph Neural Networks and Curriculum Reinforcement Learning

In this study, we present a novel multi-agent reinforcement learning (MARL) framework that integrates edge-feature coupled graph neural networks (GNNs) with progressive curriculum learning to address the challenge of three-dimensional path planning for multiple unmanned aerial vehicles (UAVs) operating in dense static obstacle environments. The proposed approach is designed to overcome the limitations of conventional global planning methods that lack adaptability to dynamic configurations and fail to capture intricate spatial relationships among UAVs and obstacles. By explicitly modeling the edge relations between UAVs and obstacles using a tailored attention mechanism, we enhance local perception and enable more informed decision-making. Furthermore, a three-stage curriculum training strategy progressively increases obstacle density, accelerating convergence and improving policy robustness. Extensive simulations conducted on the PyBullet physics platform demonstrate that our framework significantly reduces average path length, boosts obstacle avoidance success rates, and enhances overall mission completion efficiency. The proposed method offers a practical and scalable solution for multi-UAV cooperative navigation, with strong potential for deployment in real-world China UAV applications such as disaster response, industrial inspection, and logistics. This work also provides a valuable experimental platform for teaching and research in multi-agent systems and deep reinforcement learning.

1. Introduction

The rapid proliferation of China UAV technologies has unlocked unprecedented opportunities across a wide spectrum of civil and military domains. From precision agriculture and aerial surveillance to rapid package delivery and emergency search-and-rescue, UAVs have become indispensable tools. However, as mission complexity escalates, the limitations of single-UAV operations — such as restricted payload capacity, limited endurance, and vulnerability to failures — become increasingly apparent. Consequently, multi-UAV collaborative systems have emerged as a powerful paradigm to enhance efficiency, redundancy, and coverage. Despite these advantages, the core challenge of multi-UAV path planning remains formidable: each agent must navigate to its designated target while avoiding collisions with static obstacles, dynamic obstacles, and other UAVs, all within a shared three-dimensional airspace. Traditional approaches, including A*, rapidly-exploring random trees (RRT), and potential field methods, often struggle in highly dynamic and unstructured environments due to their reliance on complete prior knowledge and limited adaptability. Deep reinforcement learning (DRL), particularly single-agent algorithms, has shown promise but falls short in handling the coupled decision-making inherent in multi-agent settings. The centralized training with decentralized execution (CTDE) paradigm, exemplified by multi-agent proximal policy optimization (MAPPO), offers a principled framework for coordinating multiple agents while maintaining scalability. Nevertheless, standard MAPPO implementations that rely on simple concatenation of agent observations fail to capture the rich structural dependencies within the environment, such as the relative positions and velocities between agents and obstacles. To address this gap, we propose augmenting MAPPO with a graph neural network (GNN) that processes observations as dynamic graphs, thereby enabling the agent to reason about spatial configurations more effectively. Moreover, we introduce an edge-coupling mechanism that fuses edge features (e.g., relative distance and velocity) directly into the attention computation, significantly improving the model’s ability to distinguish between critical interactions. Finally, we incorporate a curriculum learning (CL) strategy that progressively increases obstacle density during training, guiding the agent from simple to complex scenarios and resulting in faster convergence and superior final performance. Our contributions are threefold: (1) we design an edge-enhanced GNN architecture that explicitly couples node and edge features through concatenation and nonlinear transformation; (2) we construct a three-stage progressive curriculum training scheme that gradually transfers learned policies across environments of increasing difficulty; (3) we validate our approach through extensive experiments on a high-fidelity 3D multi-UAV simulation platform, demonstrating state-of-the-art performance in terms of success rate, collision rate, and average reward. This work not only advances the state of the art in multi-UAV path planning but also provides an accessible experimental framework for educational purposes, enabling students and researchers to explore the intersection of China UAV systems, graph neural networks, and reinforcement learning.

2. Methodology

2.1 Problem Formulation

We formulate the multi-UAV path planning problem as a decentralized partially observable Markov decision process (Dec-POMDP) consisting of N agents. At each discrete time step t, agent i receives a local observation o_i^t that includes its own state and the states of neighboring entities within a fixed sensor range. Based on this observation, the agent selects an action a_i^t according to its stochastic policy π_i. After all agents execute their actions, the environment transitions to the next state and each agent receives a scalar reward r_i^t. The objective is to maximize the expected cumulative discounted reward over the episode length T. The motion model for each UAV is governed by the following discrete-time dynamics:

$$
\begin{aligned}
p_i^{t+1} &= p_i^t + v_i^{t+1} \Delta t \\
v_i^{t+1} &= v_i^t + \frac{F_i^t}{m} \Delta t
\end{aligned}
$$

where p_i^t ∈ ℝ³ and v_i^t ∈ ℝ³ denote the position and velocity of UAV i at time t, F_i^t is the applied force, m is the mass, and Δt is the simulation time step.

2.2 State and Action Spaces

The local observation o_i of UAV i is composed of two parts: self-state s_self,i and neighbor-state s_other,i:

$$
\begin{aligned}
s_{\text{self},i} &= \{ p_{\text{target},i} – p_i,\; v_i \} \\
s_{\text{other},i} &= \{ p_j – p_i,\; v_j – v_i,\; p_{\text{target},j} – p_j,\; d_{ij} \} \quad (j \neq i)
\end{aligned}
$$

where d_ij = ||p_j − p_i||₂ is the Euclidean distance. The action space is continuous, with each dimension representing the ratio of acceleration to the maximum acceleration a_max along three axes:

$$
u_i = \left( \frac{a_{i,x}^t}{a_{\max}},\; \frac{a_{i,y}^t}{a_{\max}},\; \frac{a_{i,z}^t}{a_{\max}} \right)
$$

The actual acceleration is sampled from a diagonal Gaussian distribution and bounded by a tanh function to ensure smooth control.

2.3 Reward Function Design

The reward function is a weighted sum of seven components, each designed to encourage desirable behaviors while penalizing collisions and inefficient flight:

$$
R_i = w_0 R_i^{\text{arrival}} + w_1 R_i^{\text{move}} + w_2 R_i^{\text{orientation}} + w_3 R_i^{\text{time}} + w_4 R_i^{\text{hold}} + w_5 R_i^{\text{collision}} + w_6 R_i^{\text{safety}}
$$

The weights are set empirically as w = [1.0, 10.0, 0.5, 1.0, 1.0, 1.0, 2.5]. Detailed definitions are as follows:

$$
R_i^{\text{arrival}} = \begin{cases}
50, & \text{if first time } d_{\text{target},i}^t \leq d_{\min} \\
0, & \text{otherwise}
\end{cases}
$$

$$
R_i^{\text{move}} = \begin{cases}
d_{\text{target},i}^{t-1} – d_{\text{target},i}^t, & \text{if } d_{\text{target},i}^t > d_{\min} \\
0, & \text{otherwise}
\end{cases}
$$

$$
R_i^{\text{orientation}} = \cos(\theta_{\text{target},i})
$$

$$
R_i^{\text{time}} = \begin{cases}
-0.05, & \text{if } d_{\text{target},i}^t > d_{\min} \\
0, & \text{otherwise}
\end{cases}
$$

$$
R_i^{\text{hold}} = \begin{cases}
2.5, & \text{if } d_{\text{target},i}^t \leq d_{\min} \\
0, & \text{otherwise}
\end{cases}
$$

$$
R_i^{\text{collision}} = \begin{cases}
-100, & \text{if collision} \\
0, & \text{otherwise}
\end{cases}
$$

$$
R_i^{\text{safety}} = \begin{cases}
(d_{\text{other},i} – d_{\text{safe}}), & \text{if } d_{\text{other},i} \leq d_{\text{safe}} \\
0, & \text{otherwise}
\end{cases}
$$

where d_min = 0.1 and d_safe = 0.4. This combination provides dense reward signals to facilitate learning in sparse-reward settings.

2.4 Edge-Enhanced Graph Neural Network (EC-InforMAPPO)

Traditional MARL algorithms like MAPPO treat each agent’s observation as a flat vector, discarding the relational structure. In contrast, we adopt the Informative Multi-Agent RL (InforMARL) paradigm that constructs a local graph for each agent, where nodes represent the agent itself, neighboring UAVs, obstacles, and target points. Edges encode relative kinematic information. However, the original InforMARL linearly adds edge features to node features in the TransformerConv layer, which limits expressiveness. We propose an Edge-Coupled variant that concatenates edge features with node features before computing attention weights.

Formally, for agent i, we define the feature vector of neighbor node j as x_j and the edge feature as e_ij = [p_ij, v_ij, ||p_ij||₂] where p_ij = p_j − p_i and v_ij = v_j − v_i. The key and value vectors are computed as:

$$
K_{ij} = W_K \cdot \text{Concat}(x_j,\, e_{ij}), \quad V_{ij} = W_V \cdot \text{Concat}(x_j,\, e_{ij})
$$

where W_K and W_V are learnable weight matrices. The query vector for the central node i is Q_i = W_Q·x_i. The attention coefficient is then:

$$
\alpha_{ij} = \text{softmax}_j\left( \frac{Q_i^\top K_{ij}}{\sqrt{d_K}} \right)
$$

The aggregated feature is obtained by weighted sum of the values, followed by a residual connection:

$$
x_{\text{agg},i} = \sigma\left( \sum_{j \in \mathcal{N}(i)} \alpha_{ij} V_{ij} \right) + x_i
$$

where σ is an activation function (e.g., ReLU) and 𝒩(i) denotes the set of neighbors within the sensor range. This edge-coupling mechanism allows the model to directly modulate attention based on relative distances and velocities, significantly improving obstacle awareness. The architecture is stacked with multiple layers to enable multi-hop reasoning. Finally, the aggregated feature is concatenated with the agent’s own observation and fed into the actor network to produce a Gaussian action distribution:

$$
\mu_i,\; \sigma_i = \text{MLP}_{\text{Actor}}(\text{Concat}(x_{\text{agg},i},\, o_i))
$$

$$
a_i = \tanh(u_i),\quad u_i \sim \mathcal{N}(\mu_i,\, \sigma_i)
$$

For the critic, we perform mean pooling over all agents’ aggregated features to obtain a global state representation:

$$
X_{\text{agg}} = \frac{1}{N} \sum_{i=1}^{N} x_{\text{agg},i}
$$

The value function is then V(S) = MLP_Critic(X_agg). This design ensures permutation invariance and scalability to varying numbers of agents.

2.5 Progressive Curriculum Learning (EC-InforMAPPO-CL)

Training MARL directly in complex environments with dense obstacles often leads to slow convergence or suboptimal policies due to the vast exploration space. Inspired by curriculum learning, we propose a three-stage progressive training schedule:

Stage 1 (Easy): 5 UAVs with 15 obstacles (low density). Agents learn basic obstacle avoidance and goal-seeking behaviors.
Stage 2 (Medium): 5 UAVs with 30 obstacles (moderate density). The policy from Stage 1 is used as initialization.
Stage 3 (Hard): 5 UAVs with 45 obstacles (high density). The policy from Stage 2 is further fine-tuned.

Each stage runs for a fixed number of environment steps (e.g., 5 million total). The transfer of parameters accelerates convergence and yields more robust policies. We denote the curriculum-trained variant as EC-InforMAPPO-CL. The parameter count and training details are summarized in Table 2.

3. Simulation Environment and Experimental Setup

We implement our 3D multi-UAV path planning environment using PyBullet, an open-source physics engine that provides realistic rigid-body dynamics and collision detection. The environment is configured as follows:

UAVs: Modeled as spheres with radius 0.2 units. Each UAV has a sensor range of 2.0 units within which it can detect other entities.
Obstacles: Static spheres with radius 0.4 units, randomly placed in a 3D bounding box of size 6 × 6 × 2 (units). Obstacle density varies per difficulty level.
Targets: Assigned uniquely to each UAV, randomly generated at a distance of at least 2 units from the starting position.
Episode: Maximum length of 100 time steps. An episode terminates if all UAVs reach their targets or if the time limit is reached.

Figure above illustrates a typical scenario with multiple UAVs navigating through a dense obstacle field. The red spheres represent obstacles, while the colored spheres denote UAVs with their corresponding target waypoints. The lines indicate planned trajectories.

4. Experiments and Results

4.1 Baseline Methods and Evaluation Metrics

We compare our proposed EC-InforMAPPO and EC-InforMAPPO-CL against four baselines: RMAPPO (recurrent version of MAPPO), RMADDPG (recurrent MADDPG), RMATD3 (recurrent MATD3), and InforMAPPO (the original GNN-based method without edge coupling). All algorithms use the same reward function and network architecture where applicable. The following metrics are employed:

Average Return: Mean cumulative reward per agent per episode.
Average Success Rate (%): Percentage of UAVs that reach their target without collisions.
Average Collision Rate (%): Percentage of UAVs that experience at least one collision during an episode.
Average Collision Frequency: Mean number of collisions per UAV per episode.

4.2 Training Details

All algorithms are trained for 5 million environment steps in each scenario (Easy, Medium, Hard) using a single NVIDIA RTX 3090 GPU. The hyperparameters are listed in Table 1.

Table 1: Hyperparameter settings
Parameter	Value
Number of agents	5
Discount factor γ	0.99
GAE λ	0.95
PPO clip ratio	0.2
Actor learning rate	3e-4
Critic learning rate	1e-3
GNN layers	2
Hidden dimension	64
Number of training steps per stage	5 million (total)
Batch size	1024
Buffer size	32768

4.3 Training Results

Figures of training curves (omitted in this text, but summarized below) show that EC-InforMAPPO consistently outperforms all baselines across all difficulty levels. The edge-coupling mechanism yields faster convergence and higher final returns. With curriculum learning, EC-InforMAPPO-CL achieves even better performance: in the Medium scenario, it reaches a success rate of 97.5% after 5 million steps, whereas the baseline InforMAPPO only reaches 94.7%. In the Hard scenario, EC-InforMAPPO-CL achieves 97.0% success rate compared to 94.9% for EC-InforMAPPO without CL.

Table 2 presents the quantitative test results averaged over 1000 independent episodes with different random seeds.

Table 2: Performance comparison across scenarios (best values in bold)
Algorithm	Metric	Easy	Medium	Hard
RMADDPG	Avg Return	93.10	72.39	48.35
	Success (%)	39.9	38.2	25.7
	Collision (%)	2.2	3.6	6.4
	Collision Freq.	0.04	0.07	0.13
RMATD3	Avg Return	97.25	92.53	60.75
	Success (%)	38.8	37.6	18.6
	Collision (%)	2.4	3.0	4.4
	Collision Freq.	0.05	0.06	0.09
RMAPPO	Avg Return	164.67	146.15	114.53
	Success (%)	87.5	81.4	77.8
	Collision (%)	11.9	17.1	20.4
	Collision Freq.	0.24	0.35	0.41
InforMAPPO	Avg Return	189.66	191.26	180.39
	Success (%)	94.4	94.7	93.0
	Collision (%)	5.8	4.7	6.1
	Collision Freq.	0.12	0.09	0.12
EC-InforMAPPO	Avg Return	202.15	199.36	189.77
	Success (%)	96.9	96.5	94.9
	Collision (%)	3.1	3.4	4.8
	Collision Freq.	0.06	0.07	0.10
EC-InforMAPPO-CL	Avg Return	—	199.99	196.66
	Success (%)	—	97.5	97.0
	Collision (%)	—	2.4	2.9
	Collision Freq.	—	0.05	0.06
EC-InforMAPPO-CL-3M-2M	Avg Return	—	—	191.74
	Success (%)	—	—	95.9
	Collision (%)	—	—	4.0
	Collision Freq.	—	—	0.08

As shown in Table 2, EC-InforMAPPO achieves the highest average return and success rate in all three difficulty levels, while maintaining a low collision rate. The addition of curriculum learning (EC-InforMAPPO-CL) further improves both success rate and collision avoidance, especially in the Hard scenario where the success rate reaches 97.0% — a 2.1% improvement over EC-InforMAPPO without CL, and a 6% reduction in collision rate. The variant EC-InforMAPPO-CL-3M-2M, which uses only 3 million steps in Medium and 2 million in Hard (totaling 5 million as well), still surpasses the non-curriculum version, demonstrating the efficiency of curriculum transfer.

4.4 Analysis of Edge-Coupling Mechanism

To understand the benefit of edge coupling, we analyze the attention weights learned by the network. In the original InforMAPPO, edge features are simply added to node features, which can dilute the importance of relative kinematics. In our edge-coupled design, the attention weight α_ij directly depends on the concatenated vector, allowing the model to emphasize neighbors that are rapidly approaching (high relative velocity) or dangerously close (small distance). This leads to more proactive obstacle avoidance. Figure 2 (not included) illustrates a typical scenario where the edge-coupled agent initiates an evasive maneuver earlier than the baseline.

4.5 Ablation Study on Curriculum Design

We conduct an ablation study to verify the necessity of the three-stage curriculum. We compare training from scratch in the Hard scenario (5 million steps) against the curriculum approach. The curriculum-trained model (EC-InforMAPPO-CL) achieves a success rate of 97.0% after 5 million total steps (3M Medium + 2M Hard), whereas the from-scratch model barely reaches 80% success after 5 million steps. This confirms that curriculum learning significantly reduces the exploration difficulty and yields a superior policy. Furthermore, we test a variant that trains only in Medium for 3 million steps and then continues in Hard for 2 million steps (EC-InforMAPPO-CL-3M-2M). Although it uses fewer total steps, it still outperforms the from-scratch model, highlighting the value of step-wise transfer.

5. Conclusion and Future Work

In this paper, we have presented a comprehensive framework for multi-UAV path planning in three-dimensional dense obstacle environments, integrating edge-enhanced graph neural networks with progressive curriculum learning. Our contributions include: (1) an edge-coupling mechanism that enriches GNN attention with relative kinematic features, leading to more accurate spatial reasoning; (2) a three-stage curriculum training strategy that smoothly transfers knowledge from easy to hard scenarios, accelerating convergence and improving robustness; (3) extensive experimental validation on a realistic PyBullet simulation platform, demonstrating superior performance over existing MARL algorithms. The proposed EC-InforMAPPO-CL algorithm achieves a 97.0% success rate in the hardest scenario (45 obstacles) while maintaining a collision rate below 3%. This work provides a practical and scalable solution for real-world China UAV operations, such as coordinated search-and-rescue in cluttered urban environments or autonomous warehouse inventory management. Additionally, the simulation environment and algorithms are well-suited for educational purposes, enabling students to explore the interplay between multi-agent systems, graph neural networks, and reinforcement learning.

There remain several avenues for future improvement. First, we plan to extend the framework to handle dynamic obstacles, such as moving pedestrians or other UAVs, by incorporating prediction modules. Second, we aim to optimize the inference speed for real-time deployment on embedded China UAV platforms, possibly through model compression and knowledge distillation. Third, we will investigate more sophisticated curriculum designs, such as automatic difficulty adjustment based on the agent’s current performance. Finally, we intend to release an open-source version of our experimental platform to facilitate further research and education in multi-UAV path planning.

Acknowledgments: This work was supported in part by the Fujian Provincial Natural Science Foundation under Grant 2022J01117, and the Fuzhou University Graduate Education Reform Project under Grant FYAI2024010. We thank the anonymous reviewers for their constructive comments.