In the rapidly evolving landscape of low-altitude economy, the large-scale deployment of unmanned aerial vehicles (UAV drones) for logistics, urban air mobility, and emergency response demands efficient and safe cooperative trajectory planning under stringent spatio-temporal constraints. This paper addresses the challenge of coordinating multiple UAV drones operating within a shared cruising altitude layer, where each UAV drone must follow a predetermined flight schedule (takeoff time, origin, and destination) while avoiding static ground obstacles and other airborne UAV drones. We propose an enhanced multi-agent deep deterministic policy gradient algorithm guided by artificial potential fields, termed APF-MADDPG, to solve the coupled spatio-temporal coordination problem. The key contributions are threefold: First, we integrate artificial potential field theory into the reward function to transform sparse terminal rewards into dense, continuous feedback, thereby mitigating the exploration bottleneck in high-dimensional, collision-prone environments. Second, we design a priority experience replay mechanism coupled with a policy rollback protection strategy that monitors the sliding-window success rate to dynamically revert unstable policies, ensuring robust convergence in non-stationary multi-agent settings. Third, we model the hard temporal constraint arising from asynchronous takeoff times and couple it with the two-dimensional spatial domain, enabling feasible and conflict-free trajectories for multiple UAV drones. Extensive experiments under multiple random seeds across scenarios ranging from 5 to 10 UAV drones in both fixed and fully random obstacle configurations demonstrate that our algorithm achieves a global success rate between 92.1% and 97.4%. Compared to the baseline multi-agent deep deterministic policy gradient, the convergence speed is improved by approximately 59.3% and the steady-state average reward by 42.5%. Against state-of-the-art pure data-driven algorithms and a control barrier function-based safe reinforcement learning baseline, our method exhibits superior performance in avoiding local deadlocks and minimizing trajectory redundancy. The results confirm that the proposed framework provides a reliable solution for high-efficiency cooperative operations of multi-UAV drones in dense urban airspace with unknown static obstacles and stringent flight schedules.
The problem of multi-UAV drone cooperative path planning under flight schedule constraints is formulated as a decentralized partially observable Markov decision process (Dec-POMDP). At each discrete time step \( t \), each UAV drone \( i \) obtains a local observation \( o_i^t \) and selects an action \( a_i^t \) according to its policy \( \pi_i \). The joint action leads to a reward \( r_i^t \) and a transition to the next state. The objective is to maximize the expected cumulative discounted reward over the task horizon. The environment is characterized by a two-dimensional workspace \( \mathcal{W} \subset \mathbb{R}^2 \) containing \( K \) static obstacles and \( N \) UAV drones. Each UAV drone \( i \) is assigned a mission tuple \( \mathcal{T}_i = (p_i^{\text{start}}, p_i^{\text{goal}}, t_i^{\text{dep}}, v_i^{\text{max}}) \), where \( p_i^{\text{start}} \) and \( p_i^{\text{goal}} \) are the origin and destination coordinates, \( t_i^{\text{dep}} \) is the scheduled departure time, and \( v_i^{\text{max}} \) is the maximum speed. A feasible solution must satisfy the following hard constraints:
1. Time constraint: Each UAV drone must take off exactly at its scheduled departure time. Before \( t_i^{\text{dep}} \), the drone remains stationary on the ground and is not subject to collision detection. This introduces a spatio-temporal coupling: at different times, different sets of UAV drones occupy the airspace.
2. Static obstacle avoidance: At any time, the Euclidean distance between the center of UAV drone \( i \) and the center of any static obstacle \( k \) must be at least the safety threshold \( \delta_{\text{obs}} \):
$$ \| (x_i^t, y_i^t) – (x_k^{\text{obs}}, y_k^{\text{obs}}) \|_2 \geq \delta_{\text{obs}}. $$
3. Dynamic collision avoidance: For any pair of UAV drones \( i \) and \( j \) that are both airborne, the inter-drone distance must be at least \( \delta_{\text{safe}} \):
$$ \| (x_i^t, y_i^t) – (x_j^t, y_j^t) \|_2 \geq \delta_{\text{safe}}. $$
4. Task completion: Each UAV drone must reach its goal region (a circle of radius \( \zeta_{\text{range}} \) centered at \( p_i^{\text{goal}} \)) within a maximum number of steps \( T_{\text{max}} \).
These constraints are mutually coupled: the asynchronous takeoff times cause a dynamically changing set of airborne UAV drones, making the collision avoidance problem time-dependent and highly non-stationary. Traditional reinforcement learning approaches suffer from sparse rewards and poor exploration in such high-dimensional, temporally constrained environments. To address these challenges, we introduce the APF-MADDPG algorithm.
Methodology: APF-MADDPG
The algorithm adopts the centralized training with decentralized execution (CTDE) paradigm. Each UAV drone \( i \) maintains an actor network \( \mu_i \) (policy) and a critic network \( Q_i \) (value function). During training, the critic for each agent has access to the global state (observations and actions of all agents), which stabilizes learning in the non-stationary multi-agent setting. The actor only uses its local observation during execution, enabling real-time decentralized decision-making.
State and Action Space
For each UAV drone \( i \), the local observation vector \( o_i \) is composed of four parts: self-state, teammate information, target information, and obstacle information, all normalized to the workspace dimensions \( l_w \) and \( l_l \). The self-state includes the normalized position and velocity:
$$ o_i^{\text{UAV}} = \left[ \frac{x_i}{l_w}, \frac{y_i}{l_l}, \frac{v_{x,i}}{v_{\text{max}}}, \frac{v_{y,i}}{v_{\text{max}}} \right]. $$
The teammate observation encodes the relative positions of other airborne UAV drones:
$$ o_i^{\text{team}} = \left[ \frac{x_i – x_j}{l_w}, \frac{y_i – y_j}{l_l} \right], \quad \forall j \neq i. $$
The target observation includes the distance to the goal and the bearing angle:
$$ o_i^{\text{target}} = \left[ \frac{\| p_i^{\text{goal}} – (x_i, y_i) \|_2}{l_w + l_l}, \arctan\left( \frac{y_i^{\text{goal}} – y_i}{x_i^{\text{goal}} – x_i} \right) \right]. $$
The obstacle observation gives the distance to the nearest static obstacle:
$$ o_i^{\text{obs}} = \left[ \frac{\| (x_k^{\text{obs}}, y_k^{\text{obs}}) – (x_i, y_i) \|_2}{l_w + l_l} \right]. $$
The action space is continuous: each UAV drone outputs an acceleration vector \( a_i^t = [u_{x,i}^t, u_{y,i}^t]^T \in [-1, 1]^2 \). The velocity is updated according to:
$$ \begin{pmatrix} v_{x,i}^{t+1} \\ v_{y,i}^{t+1} \end{pmatrix} = \begin{pmatrix} v_{x,i}^{t} + u_{x,i}^t \Delta t \\ v_{y,i}^{t} + u_{y,i}^t \Delta t \end{pmatrix}. $$
where \( \Delta t = 0.5 \) s. The acceleration magnitude is constrained by the maximum thrust, which is implicitly enforced by the action bounds. To encourage exploration, an Ornstein-Uhlenbeck noise process is added to the action during training: \( \tilde{a}_i^t = \mu_i(o_i^t) + \mathcal{N}_t^{\text{OU}} \), where the noise variance decays exponentially from 0.5 to 0.05 over the course of training.
Reward Function with Artificial Potential Fields
To overcome the sparse reward problem, we design a composite reward function that integrates artificial potential field (APF) concepts. The reward for UAV drone \( i \) at time step \( t \) is:
$$ r_i^t = r_{\text{att}}^i + r_{\text{rep}}^i + r_{\text{task}}^i. $$
The attractive reward \( r_{\text{att}}^i \) encourages the drone to move toward its goal. Let \( \Delta d_i^t = \| p_i^{\text{goal}} – (x_i^{t-1}, y_i^{t-1}) \|_2 – \| p_i^{\text{goal}} – (x_i^t, y_i^t) \|_2 \) be the reduction in distance to the goal. Then:
$$ r_{\text{att}}^i = \begin{cases} \lambda_1 \cdot \Delta d_i^t, & \Delta d_i^t \leq d_1 \\ \lambda_2 \cdot \Delta d_i^t, & d_1 < \Delta d_i^t \leq d_2 \\ \lambda_3 \cdot \Delta d_i^t, & \Delta d_i^t > d_2 \end{cases} $$
where \( \lambda_1 > \lambda_2 > \lambda_3 \) are decreasing gain coefficients, and \( d_1, d_2 \) are distance thresholds. This design gives larger rewards for moving closer to the goal when far away, and smaller rewards when close, to avoid overshooting.
The repulsive reward \( r_{\text{rep}}^i \) penalizes proximity to static obstacles and other UAV drones. Let \( d_{i,\text{min}}^t \) be the minimum distance to any obstacle or airborne neighboring drone. Then:
$$ r_{\text{rep}}^i = \begin{cases} -\eta \left( \frac{1}{d_{i,\text{min}}^t} – \frac{1}{d_0} \right)^2, & d_{i,\text{min}}^t \leq d_0 \\ 0, & d_{i,\text{min}}^t > d_0 \end{cases} $$
where \( \eta \) is a repulsion gain and \( d_0 = 8 \) m is the APF influence radius. This provides a continuous negative gradient as the drone approaches obstacles, serving as an early warning signal that encourages avoidance before a collision occurs.
The task reward includes sparse terminal signals:
$$ r_{\text{task}}^i = \begin{cases} R_{\text{goal}}, & \| p_i^{\text{goal}} – (x_i^t, y_i^t) \|_2 < \zeta_{\text{range}} \\ R_{\text{crash}}, & \text{collision detected} \\ R_{\text{boundary}}, & \text{out of bounds} \\ -1, & \text{every step (exist penalty)} \\ +20, & \text{every teammate that reaches goal} \end{cases} $$
Typical values: \( R_{\text{goal}} = +100 \), \( R_{\text{crash}} = -200 \), \( R_{\text{boundary}} = -200 \), \( \zeta_{\text{range}} = 5 \) m. The team bonus fosters cooperation.
Training Stability Mechanisms
We employ two additional techniques to stabilize training:
1. Prioritized Experience Replay (PER): Instead of uniform sampling from the replay buffer, we sample transitions with probability proportional to their temporal-difference (TD) error. The sampling probability of the \( k \)-th experience is:
$$ P(k) = \frac{p_k^{\alpha}}{\sum_{j} p_j^{\alpha}}, \quad p_k = |\delta_k| + \epsilon, $$
where \( \alpha = 0.4 \) controls the prioritization strength, and \( \epsilon = 1e-5 \) is a small constant to avoid zero probability. This focuses learning on transitions that are most surprising to the current value function, accelerating convergence.
2. Policy Rollback Protection: Multi-agent training can suffer from catastrophic forgetting or sudden performance collapse due to non-stationarity. We monitor a sliding window success rate \( S_t \) (over the last 100 episodes) and compare it to the historical best \( S_{\text{best}} \). A rollback is triggered if:
$$ R_t = \begin{cases} \text{True}, & S_t > \xi \text{ and } S_t < k \cdot S_{\text{best}} \\ \text{False}, & \text{otherwise} \end{cases} $$
where \( \xi = 0.85 \) is the activation threshold and \( k = 0.7 \) is the collapse ratio. When triggered, the actor and critic networks revert to the previously saved best-performing parameters, and the exploration noise is reset to a higher value to re-explore the policy space. This mechanism effectively suppresses divergent fluctuations and ensures robust convergence.
Experimental Setup
All experiments are conducted in a custom Python 3.12 simulation environment. The workspace is a 100 m × 100 m square. We consider three main configurations: (1) 5 UAV drones with 3 fixed circular obstacles; (2) 8 UAV drones with 5 randomly placed obstacles; (3) 10 UAV drones with 5 randomly placed obstacles. The origins and destinations for the 5-UAV scenario are listed in the table below. For larger scales, OD pairs are generated with similar spatial diversity. The takeoff times are randomized to simulate an asynchronous flight schedule (e.g., delays from 0 to 50 seconds). Each UAV drone has a maximum speed of 5 m/s and a decision interval of 0.5 s.
| UAV drone ID | Start (x, y) | Goal (x, y) |
|---|---|---|
| 1 | (92, 14) | (8, 91) |
| 2 | (67, 15) | (70, 92) |
| 3 | (36, 11) | (68, 70) |
| 4 | (21, 54) | (87, 85) |
| 5 | (80, 8) | (25, 85) |
| Obstacle ID | Center (x, y) | Radius (m) |
|---|---|---|
| Obs 1 | (67, 40) | 4 |
| Obs 2 | (37, 30) | 4 |
| Obs 3 | (81, 62) | 4 |
All neural networks have two hidden layers with 256 and 128 units, respectively, using ReLU activations. The critic learning rate is \( 5 \times 10^{-4} \), the actor learning rate is \( 1 \times 10^{-4} \), and the discount factor \( \gamma = 0.95 \). The replay buffer holds up to 50,000 transitions, and mini-batch size is 1024. Target networks are updated with soft update coefficient \( \tau = 0.01 \). Each training run consists of 20,000 episodes, with each episode lasting at most 300 steps. For statistical significance, every algorithm is trained with five different random seeds, each with a different takeoff schedule. We report mean and dispersion metrics.
Results and Analysis
We first evaluate the learning capability of the proposed APF-MADDPG algorithm on the 5-UAV fixed-obstacle scenario. Figure 1 shows the convergence curves of cumulative episode reward and task success rate averaged over the five seeds. The reward rises steeply in the first 7,000 episodes and stabilizes around 4,550. The success rate surpasses 90% after 8,000 episodes and eventually converges to approximately 98%. The few transient dips in the reward coincide with the activation of the rollback mechanism, which quickly restores performance.
The trajectories generated by the trained policy are illustrated in the flash image below, which captures the positions of the five UAV drones at various time steps. The asynchronous takeoff times are respected: earlier-departing UAV drones start moving first, while later ones remain stationary on the ground. The spatio-temporal coupling is successfully managed; no collisions occur, and all UAV drones reach their goals within the allotted steps.

To test generalization and scalability, we evaluate the trained APF-MADDPG model on scenarios with randomly placed obstacles. For each scale (5, 8, 10 UAV drones), we perform 100 test episodes with entirely different obstacle layouts. The results are summarized in the next table.
| Number of UAV drones | Global Success Rate (%) | Average Inference Time per Step (ms) |
|---|---|---|
| 5 | 97.4 | 1.62 |
| 8 | 94.8 | 2.07 |
| 10 | 92.1 | 4.31 |
The global success rate remains above 92% even with 10 UAV drones, indicating that the algorithm learns a generalizable avoidance strategy that does not overfit to specific obstacle configurations. The inference time increases modestly with the number of agents, because the algorithm must process observations for all UAV drones sequentially.
Comparative Study with Baselines
We compare APF-MADDPG against the following algorithms: (1) IDDPG – fully independent DDPG, treating other UAV drones as moving obstacles; (2) APF-IDDPG – IDDPG with the same APF reward function; (3) MADDPG – standard multi-agent DDPG without APF reward or rollback; (4) MAPPO – state-of-the-art multi-agent PPO; (5) MATD3 – multi-agent twin delayed DDPG; (6) CBF-MADDPG – MADDPG with a control barrier function safety filter appended (a representative safe RL baseline). All algorithms use identical network architecture and hyperparameters where applicable. The 5-UAV fixed-obstacle scenario is used for the comparison, with five random seeds each.
The average reward convergence curves are shown in the figure below (not displayed due to text format). The key numerical results aggregated over 100 independent test episodes (all algorithms trained to convergence) are given in the following two tables: one for mean performance and one for dispersion metrics (interquartile range IQR and coefficient of variation CV).
| Metric | APF-MADDPG | MADDPG | APF-IDDPG | IDDPG | MATD3 | MAPPO | CBF-MADDPG |
|---|---|---|---|---|---|---|---|
| Average Reward | 4556.2 | 3198.3 | 3351.9 | 2863.5 | 3012.4 | 2786.1 | 3120.7 |
| Task Completion Rate | 0.986 | 0.661 | 0.564 | 0.347 | 0.542 | 0.410 | 0.247 |
| Global Success Rate | 0.978 | 0.085 | 0.060 | 0.005 | 0.040 | 0.064 | 0.000 |
| Trajectory Elongation Rate | 0.041 | 0.074 | 0.118 | 0.085 | 0.20 | 0.22 | 0.10 |
| Average Flight Steps | 43.8 | 36.1 | 47.3 | 47.6 | 45.1 | 40.6 | 33.0 |
| Collision Rate | 0.007 | 0.348 | 0.459 | 0.632 | 0.453 | 0.551 | 0.740 |
| Threat Frequency (proximity events) | 4.19 | 1.71 | 3.88 | 1.52 | 3.41 | 4.62 | 1.52 |
| Metric | Statistical Measure | APF-MADDPG | MADDPG | APF-IDDPG | IDDPG | MATD3 | MAPPO | CBF-MADDPG |
|---|---|---|---|---|---|---|---|---|
| Average Reward | IQR / CV (%) | 72.1 / 1.18 | 294.5 / 6.51 | 297.3 / 6.96 | 327.2 / 9.79 | 215.6 / 7.87 | 308.4 / 10.22 | 267.3 / 8.14 |
| Task Completion Rate | IQR / CV (%) | 0.006 / 1.46 | 0.161 / 17.02 | 0.121 / 18.31 | 0.123 / 28.58 | 0.214 / 18.62 | 0.221 / 23.70 | 0.153 / 15.68 |
| Global Success Rate | IQR / CV (%) | 0.032 / 1.89 | 0.075 / 63.57 | 0.077 / 81.12 | 0.009 / — | 0.275 / 37.17 | 0.247 / 27.52 | 0.208 / 15.32 |
| Average Flight Steps | IQR / CV (%) | 2.53 / 4.46 | 2.91 / 6.17 | 6.62 / 11.12 | 7.86 / 12.76 | 3.93 / 15.57 | 2.77 / 12.42 | 4.88 / 8.10 |
| Collision Rate | IQR / CV (%) | 0.012 / 123.6 | 0.096 / 20.85 | 0.115 / 19.98 | 0.117 / 14.05 | 0.025 / 32.74 | 0.125 / 27.28 | 0.210 / 14.25 |
The results clearly demonstrate the superiority of APF-MADDPG. The global success rate (all UAV drones reaching their goals without collisions) is 0.978, whereas the next best is 0.085 from baseline MADDPG. The average reward is 42.5% higher than MADDPG, and the collision rate is reduced to near zero (0.007). The trajectory elongation rate is only 0.041, indicating efficiently planned paths with little detour. The dispersion metrics show that APF-MADDPG has the smallest IQR and very low CV (1–4%) for most metrics, implying outstanding robustness and reproducibility.
In contrast, the pure data-driven SOTA algorithms (MAPPO, MATD3) suffer from high variability and low success rates, as the asynchronous takeoff introduces severe non-stationarity that their value networks cannot handle. The CBF-MADDPG baseline, which enforces hard safety constraints via quadratic programming, actually leads to a high collision rate (0.740) because the safety filter often results in deadlocks when multiple UAV drones negotiate narrow corridors; the agent gets stuck and eventually violates time limits or collides. The APF-based continuous reward provides a smoother and more effective guidance, preventing deadlocks while preserving safety.
Furthermore, we evaluate the convergence speed. APF-MADDPG stabilizes to 95% success rate in 4,643 episodes on average, compared to 11,396 episodes for plain MADDPG – a 59.3% reduction. IDDPG and APF-IDDPG never reach the 95% threshold within 20,000 episodes. This acceleration is attributed to the dense reward signals from the potential field, which guide exploration from the very beginning.
Conclusion
This paper proposes APF-MADDPG, a potential field-guided multi-agent deep reinforcement learning algorithm for spatio-temporal cooperative path planning of multiple UAV drones operating under flight schedule constraints. By integrating artificial potential fields into the reward function, the algorithm overcomes the sparse reward problem and enables efficient exploration. The priority experience replay and policy rollback mechanism ensure stable convergence even in highly non-stationary environments. Extensive experiments on scenarios with 5–10 UAV drones and random static obstacles show that the algorithm achieves high success rates (92–97%), low collision rates, and short trajectory elongations, outperforming both classical MADDPG and modern SOTA baselines (MAPPO, MATD3) as well as a control barrier function-based safe RL method. The robustness is confirmed by low variability across random seeds. Future work will extend the framework to full 3D environments, moving obstacles, and realistic communication constraints in large-scale low-altitude traffic management systems.
