Multi-UAV Formation Obstacle Avoidance Control Based on PPO Algorithm

In recent years, the rapid advancement of unmanned aerial vehicle (UAV) technology has expanded its applications across various fields, particularly in complex environments such as forests, urban canyons, and disaster zones. Among these applications, drone formation operations have garnered significant attention due to their potential for collaborative tasks like surveillance, search and rescue, and environmental monitoring. However, enabling a drone formation to navigate autonomously through dense obstacles while maintaining formation integrity poses substantial challenges, including high-dimensional state spaces, dynamic interactions, and the need for real-time decision-making. Traditional methods, such as artificial potential fields, often struggle with local minima and lack adaptability in unpredictable settings. To address these issues, we propose an end-to-end motion planning approach for multi-UAV formations shuttling through forest-like environments, leveraging a chained training framework with an enhanced Proximal Policy Optimization (PPO) algorithm. This method integrates heuristic information to guide learning, ensuring efficient obstacle avoidance and formation stability in continuous 3D spaces. By framing the problem within a Markov Decision Process (MDP) and employing deep reinforcement learning (DRL), our approach enables UAVs to learn robust policies through simulation, reducing training difficulty and improving scalability. In this paper, we detail our methodology, experimental setup, and results, demonstrating the superiority of our PPO-based method over conventional techniques in terms of success rate and formation coherence. The contributions of this work lie in the novel chained training paradigm, the design of a comprehensive reward function, and the incorporation of heuristic cues, which collectively enhance the autonomy and reliability of drone formations in cluttered environments.

The core problem we address involves coordinating a drone formation of nine UAVs to traverse a densely forested 3D environment while avoiding both static obstacles (trees) and dynamic obstacles (other UAVs) and reaching designated target points. The environment is simulated using the Gymnasium framework, featuring randomly generated trees that serve as static barriers. Each UAV must navigate from its initial position to a target location, which is defined relative to a formation center. Specifically, the target for the formation center is set within a region 5–6 meters ahead along the x-axis and within ±2 meters in the y and z axes. The individual UAV targets are then derived based on a preset formation pattern, ensuring no overlaps. This setup mimics real-world scenarios where drone formations must adapt to unstructured terrains. The challenges include the continuous action space, partial observability, and the need for coordinated movements to prevent collisions and maintain formation shape. Our goal is to develop a control strategy that minimizes path length, avoids collisions, and preserves formation stability, all while operating in a fully decentralized manner where each UAV relies on local observations. This problem is inherently complex due to the multi-agent nature; simultaneous training of all UAVs often leads to non-stationarity and convergence issues. Thus, we introduce a chained training approach where UAVs are trained sequentially, with each agent learning to adapt to the fixed policies of others, thereby stabilizing the learning process. The following sections elaborate on our MDP formulation, algorithmic design, and experimental validation.

Our methodology centers on a chained PPO training framework tailored for multi-UAV formations. PPO is a policy-gradient algorithm known for its stability and efficiency in continuous control tasks. However, directly applying PPO to a multi-agent drone formation in a 3D environment can be problematic due to the high dimensionality and inter-agent dependencies. To mitigate this, we employ a distributed training scheme where only one UAV is trained at a time, while the others follow predefined or previously learned policies. This chain-like progression reduces the action space and environmental volatility, allowing the learning agent to converge more reliably. The process begins with all UAVs using a baseline artificial potential field method for obstacle avoidance. We then select the first UAV (e.g., the formation leader) for training, with its PPO policy network updated based on local observations. Once trained, this UAV’s policy is fixed, and the next UAV in the formation chain is trained, considering the now-static behavior of the first. This cycle repeats until all UAVs have been trained, and if necessary, multiple rounds of training are conducted to refine inter-agent coordination. The chained training flow ensures that each UAV learns to cooperate with its peers, ultimately achieving a cohesive drone formation strategy. Below, we formalize the MDP components, including state space, action space, reward function, and heuristic enhancements.

The state space for each UAV is designed to capture local observations relevant to navigation and formation keeping. Let UAV_i denote the currently training drone. Its state vector S_i is composed of several features: the direction vector to its target, relative positions of nearest neighbors, proximity to obstacles, its current velocity, and its deviation from the desired formation position. Mathematically, we define:

$$ \mathbf{S}_i = \left[ \mathbf{diri}, \mathbf{pos}_1, \mathbf{pos}_2, \mathbf{obs}_1, \dots, \mathbf{obs}_k, \mathbf{speed}_i, \mathbf{formation}_i \right] $$

Here, diri is a 3D vector pointing from UAV_i‘s current position to its target, encouraging goal-directed movement. pos₁ and pos₂ are vectors to the two nearest UAVs in the formation, each contributing 3 dimensions, to account for dynamic obstacles. The obstacle terms obs_j represent static trees within a 1.5-meter radius, with each obstacle described by its relative position (3 dimensions); we include up to k obstacles, where k is set to 6 based on typical density, yielding 18 dimensions. speed_i is the UAV’s current velocity vector (3 dimensions), and formation_i is the offset from its ideal formation position (3 dimensions). This results in a total state dimensionality of 3 (diri) + 6 (neighbors) + 18 (obstacles) + 3 (speed) + 3 (formation) = 33 dimensions. This compact representation balances informativeness and computational efficiency, enabling the PPO network to learn effective policies. The state design emphasizes local interactions, which is crucial for scalability in large drone formations.

The action space for each UAV is continuous, corresponding to velocity adjustments in 3D space. At each time step, UAV_i selects an action a_i = [V_x, V_y, V_z], where each component represents the velocity in meters per second along the respective axis. The velocities are constrained to |V| ≤ 0.4 m/s to reflect realistic UAV dynamics and ensure smooth movements. This continuous action space allows for fine-grained control, essential for precise navigation through tight gaps in a forest environment. The policy network, a neural network with two hidden layers of 64 units each, maps the state S_i to a multivariate Gaussian distribution over actions, from which a specific velocity vector is sampled. During training, we use the PPO clipping mechanism to update the policy parameters, preventing large deviations that could destabilize learning. The action space is integral to enabling the drone formation to maneuver flexibly while adhering to physical limits.

The reward function is a critical component that shapes the learning behavior. We formulate it as a sum of three terms: collision penalty, path cost, and formation stability reward. Each term is designed to align with the objectives of safe, efficient, and coordinated navigation for the drone formation. The collision penalty addresses both static and dynamic obstacles. For static obstacles (trees), we compute a penalty based on the minimum distance between the UAV’s trajectory and any tree during a time step. Similarly, for dynamic obstacles (other UAVs), we penalize close approaches. Let traj denote the UAV’s path over a step, discretized into n sample points. The penalties are defined as:

$$ R_{\text{obs}} = \sum_{j=1}^{n} \left( f_{\text{static}}(d_j^{\text{tree}}) + f_{\text{dynamic}}(d_j^{\text{UAV}}) \right) $$

where d_j^tree and d_j^UAV are distances to the nearest tree and UAV, respectively, and the functions f assign negative rewards based on proximity thresholds. For example:

$$ f_{\text{static}}(d) = \begin{cases}
0 & \text{if } d > 0.35 \, \text{m} \\
-10 & \text{if } 0.22 \, \text{m} \leq d \leq 0.35 \, \text{m} \\
-50 & \text{if } d < 0.22 \, \text{m}
\end{cases} $$

This graded penalty encourages the UAV to maintain a safe buffer from obstacles. The path cost incentivizes shorter trajectories by penalizing the magnitude of the action vector:

$$ R_{\text{pc}} = -\alpha \| \mathbf{a}_i \| $$

where α is a positive scaling factor (set to 0.1 in our experiments). This term discourages unnecessary detours, promoting efficiency in the drone formation‘s movement. The formation stability reward is designed to maintain the preset formation shape. We employ a virtual leader-follower model, where a virtual leader moves toward the formation target at a constant speed of 0.4 m/s. Each UAV has a desired position relative to this leader. The stability reward combines two parts: a dense reward for staying close to the desired position, and a sparse reward for matching the leader’s velocity. Specifically:

$$ R_{\text{fc}} = \frac{\beta}{\| \mathbf{p}_{\text{cur}} – \mathbf{p}_{\text{des}} \| + 0.01} $$

where p_cur is the UAV’s current position, p_des is its desired formation position, and β is a weight (set to 0.05). Additionally, if the UAV’s velocity aligns closely with the leader’s (difference < 0.05 m/s), we add a bonus R_vs = 1.5. The total reward for a step is then:

$$ R_{\text{total}} = R_{\text{obs}} + R_{\text{pc}} + R_{\text{fc}} + R_{\text{vs}} $$

This composite reward guides the UAV to balance obstacle avoidance, path shortness, and formation fidelity, which are essential for a robust drone formation.

To accelerate learning in the sparse reward setting of 3D navigation, we incorporate heuristic information as an additional guidance signal. Specifically, we add a heuristic velocity vector to the action output by the policy network. This vector points toward the UAV’s target but is scaled based on the formation’s progress to prevent any UAV from lagging or rushing ahead. Formally, the heuristic velocity for UAV_i is:

$$ \mathbf{V}_h^i = \lambda \cdot \frac{\mathbf{p}_{\text{tar}}^i – \mathbf{p}_{\text{cur}}^i}{\| \mathbf{P}_{\text{tar}} – \mathbf{P}_{\text{cur}} \|} $$

where p_tarⁱ and p_curⁱ are the target and current positions of UAV_i, P_tar and P_cur are matrices of all UAVs’ target and current positions, and λ is a small positive coefficient (set to 0.05). The denominator normalizes by the overall formation displacement, so UAVs that are behind receive a stronger push, while those ahead are moderated. This heuristic is added directly to the policy’s action during training but is removed during evaluation to test the learned policy’s autonomy. It serves as a curriculum, easing exploration early on and gradually fading as the policy improves. This approach proved crucial for stabilizing training and achieving convergence in our drone formation experiments.

We implemented our method in a custom Gymnasium environment simulating a 3D forest with nine UAVs and 40 randomly placed trees. Each tree is modeled as a cylinder with a radius of 0.2 meters, and UAVs have a collision radius of 0.1 meters. The environment is continuous in space and time, with each time step corresponding to 0.1 seconds. The UAVs start in a predefined formation (e.g., a 3×3 grid) and must reach their target positions within a maximum episode length of 500 steps. The target for the formation center is randomly sampled within the specified region each episode, and the individual UAV targets are computed accordingly. For training, we use the PPO algorithm with a learning rate of 3×10^-4, a discount factor γ = 0.99, and a clipping epsilon of 0.2. The policy and value networks are multilayer perceptrons with two hidden layers of 64 units each, using ReLU activations. Training proceeds in a chained manner: we first train UAV₁ for 1 million steps while others use artificial potential fields, then freeze its policy and train UAV₂, and so on. After all UAVs are trained once, we perform additional rounds of training (up to three) where all UAVs use their learned policies, fine-tuning for coordination. This process is summarized in Table 1, which outlines the training parameters.

Table 1: Training Parameters for the Chained PPO Approach
Parameter	Value
Number of UAVs	9
State dimensions	33
Action dimensions	3 (V_x, V_y, V_z)
Action bounds	±0.4 m/s per axis
PPO learning rate	3×10^-4
Discount factor (γ)	0.99
Clipping epsilon	0.2
Training steps per UAV	1 million
Heuristic coefficient (λ)	0.05
Reward weights (α, β)	0.1, 0.05

Our experimental results demonstrate the effectiveness of the proposed method. During training, we monitored the average episode reward and the average steps per episode. As shown in Figure 1 (simulated plot), the reward increased steadily over time, converging around -50 after approximately 2 million total steps, indicating that the UAVs learned to balance the reward components. The average steps per episode initially decreased as the UAVs learned to take shorter paths, then slightly increased as they prioritized formation stability, settling at about 52 steps. This trend reflects the trade-off between efficiency and coordination in the drone formation. To evaluate performance, we conducted 50 randomized test episodes with different tree layouts and target locations. Our PPO-based method achieved a success rate of 96% (defined as no collisions and all UAVs reaching targets), compared to 86% for a baseline artificial potential field method. Moreover, the average path cost (total distance traveled by all UAVs) was 52.25 meters for our method versus 51.84 meters for the baseline, indicating a slight sacrifice in path optimality for improved safety and formation keeping. These results are summarized in Table 2.

Table 2: Performance Comparison Between PPO and Artificial Potential Field (APF) Methods
Metric	PPO-Based Method	APF Baseline
Success Rate (no collisions)	96%	86%
Average Path Cost (meters)	52.25	51.84
Formation Stability Score*	0.89	0.72
Average Episode Steps	52	48

*Formation stability score is computed as the average inverse distance to desired formation positions over an episode (higher is better).

The superiority of our approach is further evident in qualitative observations. The UAVs exhibit smooth, coordinated movements, often weaving through tight spaces without breaking formation. In contrast, the APF method frequently led to oscillatory behaviors or deadlocks near obstacles, causing collisions or prolonged detours. The learned policies also generalize well to unseen obstacle configurations, highlighting the robustness of the DRL approach. For instance, in one test, the drone formation successfully navigated a dense cluster of trees by slightly adjusting its shape, demonstrating emergent cooperation. This adaptability is crucial for real-world deployments where environments are unpredictable. Additionally, the chained training proved efficient, reducing the sample complexity compared to simultaneous multi-agent PPO. Each UAV required about 1 million steps to train, totaling 9 million steps for the full formation, which is feasible in simulation. The heuristic information played a key role in early training, as without it, the UAVs often remained stationary to avoid collision penalties, failing to explore goal-directed behaviors. Ablation studies confirmed that removing the heuristic increased training time by 30% and lowered the final success rate to 88%. Thus, our integrated design effectively addresses the challenges of multi-UAV navigation in complex 3D spaces.

In conclusion, we have presented a novel method for multi-UAV formation obstacle avoidance using a chained PPO algorithm enhanced with heuristic information. Our approach tackles the difficulties of training in continuous, high-dimensional environments by sequentially training each UAV, incorporating a comprehensive reward function, and providing guided exploration. The results show that our method achieves a high success rate (96%) in navigating dense forests while maintaining formation stability, outperforming traditional artificial potential field methods. The slight increase in path cost is a reasonable trade-off for improved safety and coordination, making it suitable for applications where drone formation integrity is critical, such as surveillance or payload delivery in cluttered areas. Future work could extend this framework to larger formations, dynamic obstacles like birds or other vehicles, and real-world hardware implementation. Additionally, investigating meta-learning or transfer learning could reduce training time for new environments. Overall, this study contributes to the advancement of autonomous drone formation navigation, offering a scalable and robust solution for complex obstacle-rich settings. The integration of deep reinforcement learning with strategic training paradigms holds promise for the next generation of intelligent UAV systems.