Intelligent Obstacle Avoidance Control for Formation Drone Light Shows in Unknown Environments

The spectacle of a formation drone light show represents a pinnacle of coordinated aerial robotics, where hundreds of drones move in precise, dynamic patterns to create breathtaking displays in the night sky. Ensuring the safety and reliability of such a swarm in real-world, potentially cluttered environments without prior mapping is a significant challenge. The core problem extends beyond simple path planning; it involves real-time, decentralized decision-making where each drone must avoid static obstacles while maintaining the integrity of the overall formation drone light show and avoiding collisions with its neighbors. Traditional methods often fall short in these complex, unknown settings. In this work, we address this critical challenge by proposing a novel, hierarchical control framework that combines the exploratory power of deep reinforcement learning with the reactive elegance of artificial potential fields and consensus control, specifically tailored for the demands of a large-scale formation drone light show.

1. Introduction and Problem Formulation

Executing a flawless formation drone light show in an unfamiliar venue—such as over a new city square or within a natural canyon—introduces the fundamental problem of obstacle avoidance. Unlike pre-programmed shows in controlled arenas, these environments may contain unexpected static obstacles like trees, flagpoles, or architectural features. The drones, typically modeled as agents with simplified dynamics, must navigate cooperatively. We consider a swarm of n+1 drones, with one designated as the leader and n as followers. Their motion in a 2D plane (extendable to 3D) is governed by:

$$
\begin{aligned}
\dot{x}_i &= v_i, \\
\dot{v}_i &= u_i, \quad i=1,2,\dots,n+1.
\end{aligned}
$$

where $x_i$ and $v_i$ are the position and velocity vectors, and $u_i$ is the control acceleration input. For a fixed-wing style drone used in endurance-focused shows, constraints include minimum/maximum speed and turn rate: $0 < v_{\text{min}} \leq \|v_i\|_2 \leq v_{\text{max}}, \quad |\dot{\theta}_i| < \dot{\theta}_{\text{max}}$.

The environment contains $m$ unknown, static circular obstacles defined by their center $(x_p, y_p)$ and radius $r_p$. The primary control objectives for a successful formation drone light show are:

Leader Goal Reaching & Obstacle Avoidance: The leader must navigate to a target position while avoiding all obstacles: $\min\|x_l – x_p\|_2 > r_p$.
Formation Maintenance & Collision Avoidance: All drones must maintain a desired geometric formation relative to the leader and avoid collisions with each other and obstacles: $\min\|x_i – x_j\|_2 > r_{\text{safe}}, \min\|x_i – x_p\|_2 > r_p$.
Asymptotic Stability: The swarm must achieve: $\lim_{t\to\infty}(x_l(t) – x_{\text{target}}(t)) = 0$, $\lim_{t\to\infty}(v_i(t) – v_j(t)) = 0$, and $\lim_{t\to\infty}(r_i(t) – r’_i(t)) = 0$, where $r’_i$ is the desired formation offset.

Communication is assumed to be centralized for leader-follower commands but limited locally among followers for proximity sensing, represented by an undirected graph $\mathcal{G}=(\mathcal{V},\mathcal{E})$ with a Laplacian matrix $L$.

2. Related Work and Proposed Architecture

Existing methods for swarm navigation include optimization-based planners, heuristic algorithms, geometric guidance, artificial potential fields (APF), and machine learning. APF methods are reactive and simple but prone to local minima (e.g., drones getting stuck). Classical planning methods require prior maps, unsuitable for unknown formation drone light show environments. Reinforcement Learning (RL), particularly Deep Deterministic Policy Gradient (DDPG), offers a model-free approach for continuous control but suffers from training inefficiency and overestimation bias in complex multi-obstacle settings.

We propose a hierarchical, centralized-training-decentralized-execution architecture. The leader employs an enhanced RL policy for intelligent, global-aware navigation. The followers use a distributed control law based on APF and consensus theory, ensuring they adapt the formation locally to avoid obstacles and each other while following the leader’s trajectory. This synergy aims to create a resilient and adaptive system for a formation drone light show.

3. Greedy-Enhanced Deep Reinforcement Learning for the Leader

The leader’s role is to find a safe, smooth path to the target. We formulate this as a Partially Observable Markov Decision Process (POMDP) where the state is derived from a processed perception of the environment.

3.1 State, Action, and Reward Design

The state vector for the leader is designed to be informative yet compact:

$$
s_t = \left( \frac{d_{\text{target}}}{k_r},\ \tanh(\|F_{\text{rep}}\|_2),\ \theta_{\text{target}},\ \theta_{F},\ \theta_{\text{heading}} \right)^T.
$$

Here, $d_{\text{target}}$ is distance to goal, $F_{\text{rep}}$ is the total repulsive force from nearby obstacles calculated using an APF, $\theta$ terms represent angles to target, resultant force, and current heading. The $\tanh$ normalization stabilizes learning. The action $a_t = \dot{\theta}$ is the heading angular rate, a continuous output. Velocity magnitude is managed separately by a simple APF-inspired controller for efficiency.

The reward function $R_t$ is critical for guiding learning. We use a shaped reward combining:

Action Penalty $R_a$: Penalizes extreme turns for smooth formation drone light show paths.
Obstacle Force $R_F$: A bounded penalty based on the magnitude of repulsive forces.
Heading Penalty $R_{\theta}$: Encourages pointing toward the goal when no obstacles are near.
Progress Reward $R_r$: Encourages decreasing distance and increasing velocity toward the goal. $R_r = e^{-d_{\text{target}}/200} + \dot{d}_{\text{target}}/5$.
Terminal Reward $R_{\text{end}}$: A large positive reward for reaching close to the target, scaled by final proximity, and a large negative reward for collision or timeout.

3.2 Greedy-DDPG Algorithm

Standard DDPG uses an actor-network $\mu(s|\theta^\mu)$ to output actions and a critic network $Q(s,a|\theta^Q)$ to evaluate them, updated via off-policy gradient descent. The exploration is driven by adding Ornstein-Uhlenbeck noise $N_t$ to the actions: $a_t = \mu(s_t|\theta^\mu) + N_t$. This can be inefficient early in training.

We introduce a Greedy-DDPG modification to accelerate learning and improve policy quality. Instead of executing a single noisy action, the agent generates a candidate set of $K$ actions: $A_t = \{\mu(s_t|\theta^\mu) + N_t^{(k)}\}_{k=1}^K$. The critic network evaluates each: $Q^{(k)} = Q(s_t, a_t^{(k)}|\theta^Q)$. The key is the greedy selection rule:

$$
a_t =
\begin{cases}
\arg\max_{a \in A_t} Q(s_t, a|\theta^Q), & \text{with probability } \epsilon_1,\\
\arg\min_{a \in A_t} Q(s_t, a|\theta^Q), & \text{with probability } \epsilon_2(t),\\
\text{a randomly chosen } a \in A_t, & \text{otherwise}.
\end{cases}
$$

The probability $\epsilon_2(t)$ decays over episodes. This mechanism serves dual purposes: 1) Early on, selecting the action deemed “best” or “worst” by the immature critic provides stronger, more varied gradients for faster critic convergence. 2) Later, it helps mitigate the overestimation bias common in DDPG by occasionally exploring actions the critic undervalues. This leads to more robust policy learning for the leader in a formation drone light show.

Table 1: Hyperparameters for Greedy-DDPG Training
Parameter	Value
Actor Learning Rate	0.001
Critic Learning Rate	0.001
Replay Buffer Size	50,000
Mini-batch Size	128
Discount Factor ($\gamma$)	0.99
Greedy Selection Prob. $\epsilon_1$	0.3
Initial Greedy Prob. $\epsilon_2(0)$	0.1
$\epsilon_2$ Decay Rate	0.95 per episode
Target Network Update Rate ($\tau$)	0.001

4. Consensus-Based Formation Control for Followers

While the leader learns a high-level strategy, followers must react in real-time. We design a control law that combines consensus-based formation tracking with APF-based reactive obstacle/colleague avoidance. For follower $i$, the control input is:

$$
u_i = -\sum_{j \in \mathcal{N}_i} b_{ij}[\gamma_0 (\hat{x}_i – \hat{x}_j) + \gamma_1 (\hat{v}_i – \hat{v}_j)] – h_i [\gamma_0 \hat{x}_i + \gamma_1 \hat{v}_i] + f_i^{\text{APF}}.
$$

where $\hat{x}_i = x_i – x_i^{\text{des}}(t)$ and $\hat{v}_i = v_i – v^{\text{des}}$ are the formation tracking errors relative to the leader’s commanded state. $b_{ij}$ and $h_i$ are communication link weights, $\gamma_0, \gamma_1 > 0$ are control gains. The term $f_i^{\text{APF}}$ is the repulsive force from obstacles and other drones within a safety radius, treating them as moving obstacles in the APF calculation. This elegant formulation ensures that in open space, the followers achieve consensus on the desired formation. When obstacles or other drones are detected, the APF term $f_i^{\text{APF}}$ temporarily overrides the consensus pull, causing the formation to deform safely before elasticly returning to its shape—a vital behavior for a formation drone light show navigating tight spaces.

A breathtaking formation drone light show creating a complex geometric pattern in the night sky, demonstrating the need for precise collision-free coordination.

Theorem (Formation Stability): For the follower system under control law (3) with $\gamma_0, \gamma_1 > 0$, and assuming the leader’s trajectory is bounded and twice differentiable, the follower tracking errors $\hat{x}_i, \hat{v}_i$ are globally uniformly ultimately bounded. Furthermore, in the absence of obstacle forces ($f_i^{\text{APF}} = 0$), the system asymptotically achieves the desired formation, i.e., $\hat{x}_i \to 0, \hat{v}_i \to 0$ as $t \to \infty$.

Proof Sketch: Consider the Lyapunov candidate function for the networked system:
$$ V = \frac{1}{2} \sum_i (\gamma_0 \hat{x}_i^T \hat{x}_i + \hat{v}_i^T \hat{v}_i) + \frac{1}{2} \sum_{i,j} b_{ij} \gamma_0 (\hat{x}_i – \hat{x}_j)^T(\hat{x}_i – \hat{x}_j). $$
Taking its derivative along the system trajectories and substituting the control law yields $ \dot{V} \leq -\gamma_1 \sum_i \hat{v}_i^T \hat{v}_i + \sum_i \hat{v}_i^T f_i^{\text{APF}} $. The first term is negative definite in velocities. The second term is bounded because the APF force is bounded by design for distances greater than the safety radius. Using standard Lyapunov and ultimate boundedness arguments proves the theorem.

5. Experimental Simulation and Results

We validate our framework in a simulated 2D environment with circular obstacles. The leader is trained using Greedy-DDPG over many randomized episodes with varying start/goal positions and obstacle configurations.

5.1 Leader Training Performance

Training curves show that our Greedy-DDPG converges approximately 5.9% faster (in wall-clock time) to a high-performance policy than standard DDPG, reaching a target average episode reward of 8000 in 170 episodes. More importantly, the final policy is more robust. Monte Carlo tests in three unseen environments demonstrate its superior generalization:

Table 2: Success Rate (%) in Unseen Environments (Monte Carlo)
Environment Scenario	Greedy-DDPG	Standard DDPG
Random Obstacles (Trained)	97.5%	85.2%
Large, Sparse Obstacles	95.5%	72.7%
Small, Dense Obstacles	96.4%	86.8%

In a specific test scenario, while a pure APF-based leader exhibited oscillatory heading changes when encountering multiple obstacles, the Greedy-DDPG leader produced a smoother, more decisive path, maintaining a safer clearance from obstacles—a crucial trait for a formation drone light show where the leader’s path defines the swarm’s airspace.

5.2 Full Formation Drone Light Show Demonstration

We integrate the trained Greedy-DDPG leader policy with the follower consensus-APF controller for a 5-drone swarm (1 leader, 4 followers in a diamond formation). The swarm must navigate from a start point to a target through a field of five circular obstacles.

The results are compelling. The leader charts a smooth course. The followers successfully maintain the formation in open space, deform it elastically to squeeze between obstacles or pass around them, and promptly re-establish the formation afterwards. Critically, all inter-drone distances remain above the safety threshold $r_{\text{safe}}$, and all drones avoid obstacles. The formation error, defined as the mean squared error of followers from their desired positions, remains low throughout the flight, confirming stability. In contrast, a baseline where both leader and followers use only APF methods results in significant formation distortion, occasional “jittering” before obstacles, and longer transit times, which would be unacceptable for a synchronized formation drone light show.

Table 3: Performance Comparison for Formation Flight
Metric	Our Method (Greedy-DDPG + Consensus-APF)	Pure APF Method (All Drones)
Average Formation Error (m)	< 5.0	> 15.0
Min. Drone-Obstacle Distance (m)	17.8	9.1
Mission Completion Time (s)	42.3	58.7
Collision-Free Episodes	100%	85%

6. Conclusion and Future Work

In this work, we presented an intelligent hierarchical control system designed to enable safe and reliable formation drone light shows in completely unknown, obstacle-populated environments. Our core contributions are twofold: 1) The Greedy-DDPG algorithm, which enhances the training efficiency and final policy robustness of the leader’s neural navigation controller through an intelligent action selection mechanism. 2) A distributed follower control law that seamlessly integrates consensus-based formation keeping with artificial potential fields for reactive obstacle and collision avoidance. Simulation experiments confirm that our framework allows a drone swarm to navigate complex unknown spaces while preserving formation integrity far better than baseline methods, paving the way for more ambitious and dynamic outdoor formation drone light show performances.

Future work will focus on extending the framework to fully distributed communication architectures, integrating more realistic drone aerodynamic models, and testing the system in 3D environments with dynamic obstacles, further pushing the boundaries of what is possible in autonomous aerial entertainment and other multi-drone applications.