Multi-UAV Formation Obstacle Avoidance Control Method Based on PPO Algorithm for Formation Drone Light Show

As a researcher in the field of autonomous drone systems, I have been fascinated by the potential of formation drone light shows, where multiple unmanned aerial vehicles (UAVs) operate in synchronized patterns to create dazzling aerial displays. These shows often require drones to navigate through complex environments, such as outdoor venues with static and dynamic obstacles, while maintaining precise formations. In this article, I present a novel approach based on the Proximal Policy Optimization (PPO) algorithm to address the challenge of multi-UAV formation obstacle avoidance, with a focus on applications in formation drone light shows. My goal is to develop an end-to-end motion planning method that ensures drones can avoid obstacles, stay in formation, and reach target positions efficiently, all critical for successful light shows.

The core problem involves coordinating multiple drones in a 3D continuous space, where they must avoid obstacles like trees or other drones and maintain a stable formation. For formation drone light shows, this is particularly important because any collision or deviation can disrupt the visual spectacle. Traditional methods, such as artificial potential fields, often struggle with the high dimensionality and dynamic nature of such environments. To overcome this, I propose a chain training framework using PPO, a reinforcement learning algorithm, which allows drones to learn adaptive policies through trial and error. This method incorporates heuristic information to guide learning and is designed to handle the complexities of real-world drone operations.

In the context of formation drone light shows, the environment is simulated as a 3D virtual forest with randomly generated trees as static obstacles. A swarm of nine drones must navigate through this space to reach target positions relative to a formation center, which is essential for creating cohesive light patterns. The target points are set within a range of 5–6 meters ahead of the formation center, with offsets up to 2 meters in vertical and horizontal directions. This setup mimics the dynamic requirements of a formation drone light show, where drones need to adjust their positions while avoiding obstacles and staying aligned with the overall display. The challenge lies in training drones to make real-time decisions without collisions, ensuring the show’s continuity and safety.

To model this problem, I define a Markov Decision Process (MDP) with local observation states, action spaces, and reward functions tailored for formation drone light shows. The local observation state for each drone includes: a direction vector to its target, positions of neighboring drones, information about nearby static obstacles, its current velocity, and its deviation from the expected formation position. Mathematically, for drone $i$, the state $S_i$ is represented as:

$$S_i = \{\text{dir}_i, \text{pos}_1, \text{pos}_2, \text{obs}_1, \dots, \text{speed}_i, \text{formation}_i\}$$

where $\text{dir}_i$ is the 3D vector pointing to the target, $\text{pos}_1$ and $\text{pos}_2$ are vectors to the two nearest drones, $\text{obs}_1, \dots$ represent obstacle data within 1.5 meters, $\text{speed}_i$ is the velocity, and $\text{formation}_i$ is the relative position in the formation. This state design captures the essential information for obstacle avoidance and formation maintenance in a formation drone light show.

The action space consists of continuous velocity commands in 3D: $V_x, V_y, V_z$, with each component constrained to $ |v| \leq 0.4 \, \text{m/s} $. At each time step, the drone selects an action based on its policy, updating its position accordingly. This continuous control is crucial for smooth movements in formation drone light shows, where abrupt changes can break the visual harmony.

The reward function is designed to balance multiple objectives: avoiding collisions, minimizing path cost, and maintaining formation stability. For a formation drone light show, this ensures that drones not only reach their targets safely but also stay synchronized. The total reward $R$ for each step is a sum of collision penalty $R_{\text{obs}}$, path cost $R_{\text{pc}}$, and formation rewards $R_{\text{fc}}$ and $R_{\text{vs}}$. The collision penalty is computed by sampling points along the drone’s trajectory and checking distances to obstacles. If $ \text{traj}_i $ is the position at sample $i$, and $ \text{tree} $ and $ \text{UAV}_S $ represent static and dynamic obstacles, respectively:

$$R_{\text{obs}} = \sum_{i=1}^{k} \left( R_{\text{obs}_i} + R_{\text{obst}_i} \right)$$

where:

$$R_{\text{obs}_i} = \begin{cases}
0 & \text{if } | \text{traj}_i – \text{tree} | \in (0.35, \infty) \\
-10 & \text{if } | \text{traj}_i – \text{tree} | \in [0.22, 0.35] \\
-50 & \text{if } | \text{traj}_i – \text{tree} | \in (0, 0.22)
\end{cases}$$

and similarly for dynamic obstacles. The path cost encourages efficient movement:

$$R_{\text{pc}} = -m \cdot \| \text{action} \|, \quad m > 0$$

Formation rewards promote stability: $R_{\text{fc}} = \frac{n}{ | \text{cur}_{\text{pos}} – \text{pre}_{\text{pos}} | + 0.01 }$ for $n \in (0,1)$, where $ \text{cur}_{\text{pos}} $ is the current position and $ \text{pre}_{\text{pos}} $ is the expected position in the formation. Additionally, a sparse reward $R_{\text{vs}} = 1.5$ is given if the drone’s velocity matches the leader’s within a threshold, which is vital for synchronized movements in a formation drone light show.

Heuristic information is added to accelerate learning, especially important for formation drone light shows where drones must coordinate closely. A heuristic velocity $V_h^i$ is computed as:

$$V_h^i = l \cdot \frac{ \text{cur\_pose}_i – \text{tar\_pos}_i }{ \| \text{cur\_pose} – \text{tar\_pos} \| }$$

where $l$ is a small positive coefficient. This guides drones toward their targets while adjusting based on the overall formation progress, ensuring that lagging drones catch up and leaders slow down, which enhances the cohesion of the formation drone light show.

The training framework uses a chain-based PPO approach to handle the multi-agent complexity. In each training round, only one drone’s policy is updated using PPO, while others follow fixed strategies (e.g., pre-trained policies or artificial potential fields). This stabilizes the environment and reduces the action space dimensionality. The chain training proceeds sequentially through the drones, starting with a base method and iteratively refining policies. For a formation drone light show, this allows drones to adapt to each other’s behaviors, leading to robust collective performance. The PPO algorithm optimizes a policy network $\pi_\theta(a|s)$ and a value network $V_\phi(s)$ by maximizing a clipped objective:

$$L^{\text{CLIP}}(\theta) = \mathbb{E}_t \left[ \min \left( r_t(\theta) \hat{A}_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) \hat{A}_t \right) \right]$$

where $r_t(\theta)$ is the probability ratio, $\hat{A}_t$ is the advantage estimate, and $\epsilon$ is a clipping parameter. This ensures stable policy updates, critical for learning smooth motions in formation drone light shows.

To evaluate the method, I conducted simulation experiments using a gymnasium-based environment with nine drones and 40 randomly placed trees. The drones were trained to reach target points while avoiding collisions and maintaining formation. The performance was measured in terms of success rate (no collisions) and average path cost. For comparison, I also implemented a traditional artificial potential field method. The results are summarized in the table below, highlighting the effectiveness of the PPO-based approach for formation drone light shows.

Algorithm	Success Rate (No Collision) (%)	Average Path Cost
PPO-based Reinforcement Learning	96	52.25
Artificial Potential Field	86	51.84

The PPO method achieved a higher success rate, demonstrating better obstacle avoidance capabilities, which is essential for reliable formation drone light shows. Although the path cost was slightly higher due to formation constraints, this trade-off ensures smoother and more synchronized movements. During training, the average reward converged to around -50, and the average steps per episode decreased initially as drones learned to minimize path length, then increased slightly to prioritize formation stability. These trends are captured in the following equations that model the learning progress:

$$ \text{Average Reward} = \frac{1}{N} \sum_{i=1}^{N} R_i, \quad \text{where } R_i \text{ is the cumulative reward per episode}$$

$$ \text{Average Steps} = \frac{1}{N} \sum_{i=1}^{N} T_i, \quad \text{where } T_i \text{ is the number of steps per episode}$$

For formation drone light shows, maintaining formation stability is crucial. The virtual leader-follower model used in this method ensures that drones adhere to a reference trajectory. The formation error $E_f$ is defined as the average deviation from expected positions:

$$E_f = \frac{1}{M} \sum_{j=1}^{M} \| \text{cur}_{\text{pos}_j} – \text{pre}_{\text{pos}_j} \|$$

where $M$ is the number of drones. In experiments, $E_f$ remained below 0.1 meters for the PPO method, indicating tight formation control, whereas the artificial potential field method showed larger deviations. This precision is vital for creating intricate patterns in formation drone light shows.

Additionally, the chain training framework allows for scalable deployment. As the number of drones increases, the method can be extended by adding more training rounds, ensuring that each drone’s policy is optimized within the collective context. This scalability makes it suitable for large-scale formation drone light shows involving hundreds of drones. The heuristic information further enhances learning efficiency by reducing exploration time, as drones are guided toward productive behaviors early in training.

In conclusion, the PPO-based chain training method offers a robust solution for multi-UAV formation obstacle avoidance, with direct applications to formation drone light shows. By integrating local observations, continuous actions, and multi-objective rewards, drones learn to navigate complex environments while staying in formation. The simulation results confirm that this approach outperforms traditional methods in success rate and formation stability, making it a promising tool for autonomous drone displays. Future work could explore real-world testing and integration with vision-based sensors to further enhance the adaptability of formation drone light shows in dynamic settings.

The implications for formation drone light shows are significant: with this method, show designers can program more complex and safe aerial choreographies, pushing the boundaries of what is possible in entertainment and artistic expression. The use of reinforcement learning also opens doors to adaptive shows that respond to environmental changes in real time, ensuring that formation drone light shows remain captivating and collision-free even in unpredictable conditions.