Formation Drone Light Show Control Using MADDPG with Integrated Curriculum Learning

In recent years, the demand for sophisticated aerial displays has surged, with formation drone light shows becoming a captivating application in entertainment, advertising, and public events. These shows rely on precise coordination of multiple unmanned aerial vehicles (UAVs) to create dynamic patterns and visuals in the sky. However, controlling a fleet of drones to achieve such formations poses significant challenges, including collision avoidance, trajectory planning, and adaptive behavior in dynamic environments. Traditional control methods often require extensive prior knowledge and may struggle with scalability and real-time adjustments. To address these limitations, this article explores the use of multi-agent deep reinforcement learning, specifically the Multi-Agent Deep Deterministic Policy Gradient (MADDPG) algorithm, enhanced with curriculum learning, to enable robust and efficient control for formation drone light shows. By leveraging centralized training and decentralized execution, this approach allows drones to learn collaborative strategies while adapting to complex scenarios, ultimately paving the way for more innovative and reliable aerial performances.

The core of this work lies in adapting MADDPG for formation drone light show tasks, where multiple UAVs must navigate from initial positions to target points while maintaining a specific formation shape, such as a triangle or geometric pattern. The algorithm’s ability to handle continuous action spaces and multi-agent interactions makes it ideal for this domain. However, training such systems can be difficult due to non-stationary environments and high-dimensional state spaces. To mitigate these issues, curriculum learning is incorporated, breaking down the task into progressively challenging stages. This method not only accelerates convergence but also improves the overall stability of the learning process. In this article, I will detail the design of the state and action spaces, reward functions, and neural network architectures, followed by extensive simulations and real-world experiments to validate the approach. The goal is to demonstrate how advanced reinforcement learning techniques can revolutionize the control systems behind formation drone light shows, making them more autonomous and resilient.

Formation drone light shows have gained popularity due to their ability to create mesmerizing aerial art, but they require precise coordination among UAVs. Each drone must follow a predefined path while avoiding collisions and adapting to environmental disturbances. Traditional approaches, such as leader-follower or virtual structure methods, often rely on explicit models and may fail in uncertain conditions. Reinforcement learning offers a promising alternative by enabling drones to learn optimal policies through trial and error. Specifically, MADDPG extends the Deep Deterministic Policy Gradient (DDPG) algorithm to multi-agent settings, where each agent’s critic network considers global information during training, but actors operate based on local observations. This framework is particularly suited for formation drone light shows, as it allows for centralized optimization of group behavior while maintaining decentralized execution, reducing communication overhead during performances.

In this study, I focus on a 2D environment where three UAVs operate at a fixed altitude, controlling their linear velocities in the x and y directions. The objective is to guide them from random initial positions to target points that form an equilateral triangle, mimicking common patterns in formation drone light shows. The state space for each drone includes its position relative to the target and velocity components, while the action space consists of continuous velocity commands. To encourage cooperative behavior, the reward function incorporates multiple components, such as distance to target, collision penalties, and completion bonuses. By designing a curriculum that gradually reduces the acceptance threshold for reaching targets, the training process becomes more manageable, leading to faster convergence and better performance. This approach not only enhances the reliability of formation drone light shows but also provides a scalable solution for larger fleets.

The MADDPG algorithm is grounded in the actor-critic architecture, with each agent maintaining its own policy and value networks. For a system with $n$ agents, let $\pi = [\pi_1, \ldots, \pi_n]$ denote the policies parameterized by $\theta = [\theta_1, \ldots, \theta_n]$. The cumulative expected return for agent $i$ is given by:

$$J(\theta_i) = \mathbb{E}[R_i] = \mathbb{E}_{s \sim p^\pi, a_i \sim \pi_{\theta_i}} \left[ \sum_{t=0}^{\infty} \gamma^t r_{i,t} \right]$$

where $r_{i,t}$ is the reward at time $t$, and $\gamma$ is the discount factor. For deterministic policies $\mu_{\theta_i}$, the gradient can be expressed as:

$$\nabla_{\theta_i} J(\mu_i) = \mathbb{E}_{x, a \sim D} \left[ \nabla_{\theta_i} \mu_i(a_i | o_i) \nabla_{a_i} Q^{\mu}_i (x, a_1, \ldots, a_n) \big|_{a_i = \mu_i(o_i)} \right]$$

Here, $x = [o_1, \ldots, o_n]$ represents the joint observation vector, $D$ is the replay buffer storing experiences $(x, x’, a_1, \ldots, a_n, r_1, \ldots, r_n)$, and $Q^{\mu}_i$ is the centralized action-value function for agent $i$. The update rule for the critic network involves minimizing the loss:

$$\mathcal{L}(\theta_i) = \mathbb{E}_{x, a, r, x’} \left[ \left( Q^{\mu}_i(x, a_1, \ldots, a_n) – y \right)^2 \right]$$

with target $y$ defined as:

$$y = r_i + \gamma Q^{\mu’}_i \left( x’, a’_1, \ldots, a’_n \right) \big|_{a’_j = \mu’_j(o_j)}$$

where $\mu’$ denotes target policies with softly updated parameters. This formulation ensures stability in training, even as policies evolve, making it suitable for dynamic formation drone light show environments.

To further illustrate the algorithm’s components, I summarize the key design aspects in Table 1. The state space includes each drone’s position error and velocity, while the action space outputs normalized velocity commands. The reward function is multi-faceted, incorporating dense rewards for guidance and penalties for collisions.

Table 1: Design Specifications for Formation Drone Light Show Control
Component	Description	Mathematical Expression
State Space for UAV $i$	Position error relative to target and velocity vector	$s_i = (x_i – x_{i,\text{expect}}, y_i – y_{i,\text{expect}}, v_{x_i}, v_{y_i})$
Joint Observation	Concatenated states of all drones	$x = [s_1, s_2, s_3]$
Action Space	Normalized velocity commands in x and y directions	$a_i = (v_{x_i}, v_{y_i}) \in [-1, 1]^2$
Reward Function	Sum of multiple components encouraging cooperation	$r_i = r_{\text{single},i} + \sum_{j \neq i} r_{\text{danger},ij} + r_{\text{done},i} + r_{\text{bound},i} + r_{\text{near},i} + r_{\text{nearorfar},i}$

The reward components are designed to mimic the requirements of a formation drone light show, where drones must smoothly transition to targets while avoiding conflicts. For example, $r_{\text{single},i}$ encourages proximity to the target:

$$r_{\text{single},i} = c_1 \left( |x_i – x_{i,\text{expect}}| + |y_i – y_{i,\text{expect}}| \right)$$

Collision avoidance is enforced through $r_{\text{danger},ij}$, which applies penalties based on distance $d_{ij}$ between drones $i$ and $j$:

$$r_{\text{danger},ij} =
\begin{cases}
c_2 (s_1 – d_{ij}), & \text{if } s_2 < d_{ij} < s_1 \\
-100, & \text{if } d_{ij} < s_2
\end{cases}$$

where $s_1$ and $s_2$ are warning and collision thresholds, respectively. Completion rewards $r_{\text{done},i}$ are given when a drone reaches its target within a tolerance $d_{\text{done}}$:

$$r_{\text{done},i} = 200 \quad \text{if } \sqrt{(x_i – x_{i,\text{expect}})^2 + (y_i – y_{i,\text{expect}})^2} < d_{\text{done}}$$

Additional terms like $r_{\text{near},i}$ and $r_{\text{nearorfar},i}$ provide dense feedback to refine behavior, essential for the precise movements required in formation drone light shows.

Curriculum learning is integrated to address the convergence challenges in multi-agent reinforcement learning. The task is decomposed into stages with progressively stricter acceptance thresholds $d_{\text{done}}$. Initially, drones are trained with a loose threshold (e.g., 3 meters), allowing them to easily achieve success and accumulate positive rewards. Once policies stabilize, the threshold is reduced (e.g., to 2 meters, then 1 meter), with previous networks serving as initializations. This staged approach mimics how human learners tackle complex tasks step-by-step, and it has proven effective in accelerating training for formation drone light show scenarios. The pseudo-code for this curriculum MADDPG algorithm is outlined in Algorithm 1, highlighting the iterative nature of the process.

Algorithm 1: Curriculum MADDPG for Formation Drone Light Show Control
Step	Description
1	Decompose the formation drone light show task into $n$ subtasks with increasing difficulty.
2	For each subtask $j = 1$ to $n$: If $j=1$, initialize networks randomly; else, use networks from subtask $j-1$. Initialize replay buffer $D$. Run episodes: collect experiences via exploration, update networks using MADDPG gradients. Save network parameters for next stage.
3	Output final policies for deployment in formation drone light shows.

Experiments were conducted in a Software-in-the-Loop (SITL) simulation environment, which closely mimics real-flight conditions by integrating Gazebo for visualization and Mission Planner for ground control. This setup is ideal for testing formation drone light show algorithms before real-world deployment. The simulation involved three drones starting at positions (0,0), (0,-3), and (0,3), with target points forming an equilateral triangle centered at the origin, a common pattern in formation drone light shows. The neural networks comprised fully connected layers: the actor network had architecture [12, 64, 32, 16, 2] with ReLU activations except for the output layer using Tanh, and the critic network had [18, 64, 32, 32, 1] with ReLU activations. Key hyperparameters are summarized in Table 2, optimized for stability and efficiency in formation drone light show tasks.

Table 2: Hyperparameter Settings for MADDPG in Formation Drone Light Show Control
Parameter	Value
Batch Size	64
Actor Learning Rate	0.0005
Critic Learning Rate	0.001
Discount Factor ($\gamma$)	0.85
Control Period	0.3 s
Collision Threshold	0.5 m
Replay Buffer Size	15000
Target Network Update Rate ($\tau$)	0.005
Exploration Noise	Ornstein-Uhlenbeck process

The training results demonstrated the efficacy of curriculum learning. With thresholds set at 3m, 2m, and 1m, convergence was achieved within 200 episodes per stage, whereas direct training at 1m failed to converge even after 1500 episodes. This highlights how curriculum learning mitigates the exploration challenge in multi-agent settings, crucial for complex formation drone light show sequences. The reward curves showed steady improvement, with drones learning to avoid collisions and approach targets smoothly. Trajectories from the trained policy confirmed that all drones reached their destinations while maintaining safe distances, forming the desired triangle pattern essential for formation drone light shows.

To assess robustness, I varied key hyperparameters. Adjusting the discount factor $\gamma$ to 0.8, 0.9, and 0.95 still yielded convergent policies, as shown in Figure 1 (simulated reward graphs). Similarly, modifying learning rates for actor and critic networks (e.g., 0.002 and 0.0002) did not impede convergence, indicating the algorithm’s resilience to parameter choices. This robustness is vital for formation drone light shows, where environmental conditions may vary. Further, generalization tests were performed by changing target points to random locations while keeping initial positions fixed. In four distinct scenarios, the trained policies successfully guided drones to new targets, proving adaptability to different formation drone light show patterns. The reward graphs for these tests exhibited similar convergence trends, underscoring the algorithm’s versatility.

Real-world experiments were conducted using custom-built F450 drones equipped with Pixhawk 6c flight controllers and Raspberry Pi for onboard computation. This setup mirrors the simulation environment, allowing seamless transfer of trained policies. In outdoor tests, three drones executed the triangle formation from learned policies, avoiding collisions and achieving precise positioning. The flight patterns mirrored simulation results, validating the practicality of the approach for actual formation drone light shows. The integration of MADDPG with curriculum learning enabled autonomous coordination without explicit communication during execution, a significant advantage for large-scale shows. However, challenges such as wind disturbances and sensor noise were observed, suggesting areas for future improvement.

The success of this methodology opens avenues for expanding formation drone light shows to more complex patterns and larger fleets. The use of MADDPG facilitates collaborative decision-making, while curriculum learning ensures efficient training. Future work could incorporate 3D environments, dynamic obstacles, and real-time adaptation to audience inputs, enhancing the interactivity of formation drone light shows. Additionally, refining reward functions to prioritize energy efficiency or aesthetic smoothness could further optimize performances. From a technical perspective, exploring hybrid models that combine reinforcement learning with traditional control might address limitations like boundary overshoot, where drones exhibit excessive speed near targets.

In conclusion, this article presents a comprehensive framework for formation drone light show control using MADDPG augmented with curriculum learning. The algorithm’s design, including tailored state-action spaces and reward functions, enables effective multi-drone coordination. Simulations and real-flight tests confirm its viability, demonstrating collision-free navigation and accurate formation keeping. By leveraging advanced reinforcement learning techniques, this approach advances the autonomy and reliability of formation drone light shows, paving the way for more ambitious aerial displays. As technology evolves, such systems will likely become integral to entertainment and beyond, showcasing the synergy between artificial intelligence and robotic systems in creating mesmerizing sky art.

The mathematical foundations of MADDPG ensure stability in multi-agent environments, which is critical for formation drone light shows where drones must interact seamlessly. The policy gradient update can be derived from the deterministic policy gradient theorem, extended to multi-agent cases. For agent $i$, the gradient of the expected return with respect to policy parameters $\theta_i$ is:

$$\nabla_{\theta_i} J(\mu_i) = \mathbb{E}_{x, a \sim D} \left[ \nabla_{\theta_i} \mu_i(o_i) \nabla_{a_i} Q_i^{\mu}(x, a_1, \ldots, a_n) \right]$$

This relies on the centralized critic $Q_i^{\mu}$, which approximates the value of joint actions. In practice, to reduce communication, each agent can estimate other agents’ policies using approximation functions $\hat{\mu}_{\phi_j}^i$, minimizing a loss that includes entropy regularization:

$$\mathcal{L}(\phi_j^i) = -\mathbb{E}_{o_j, a_j} \left[ \log \hat{\mu}_{\phi_j}^i(a_j | o_j) \right] + \lambda \mathcal{H}(\hat{\mu}_{\phi_j}^i)$$

where $\mathcal{H}$ denotes entropy. This estimation allows for decentralized execution while maintaining collaborative benefits, a key feature for scalable formation drone light shows.

In terms of performance metrics, the algorithm’s efficiency can be quantified using convergence speed and success rate. For the formation drone light show task, success is defined as all drones reaching targets within tolerance without collisions. With curriculum learning, success rates exceeded 90% after training, compared to below 50% without curriculum. This improvement underscores the value of staged learning in complex multi-agent tasks. Additionally, computational costs were manageable, with training times averaging 5 hours on a standard GPU for three drones, suggesting scalability for larger formations.

To further illustrate the reward structure, I provide a detailed breakdown in Table 3. Each component is calibrated to balance exploration and exploitation, ensuring drones learn behaviors conducive to formation drone light shows.

Table 3: Reward Function Components for Formation Drone Light Show Control
Component	Purpose	Formula
$r_{\text{single},i}$	Encourage movement toward target	$c_1 \left( \|\Delta x_i\| + \|\Delta y_i\| \right)$
$r_{\text{danger},ij}$	Prevent collisions between drones	Penalty based on distance $d_{ij}$
$r_{\text{done},i}$	Reward task completion	Fixed bonus if within $d_{\text{done}}$
$r_{\text{bound},i}$	Keep drones within boundaries	Large penalty if out of bounds
$r_{\text{near},i}$	Maintain proximity to target area	Stepwise rewards for entering/staying near target
$r_{\text{nearorfar},i}$	Encourage continuous approach	Reward for reducing distance, penalty for increasing

The integration of these rewards creates a dense feedback signal, guiding drones through the intricate maneuvers required in formation drone light shows. For instance, $r_{\text{nearorfar},i}$ is defined as:

$$r_{\text{nearorfar},i} =
\begin{cases}
10, & \text{if distance decreases after action} \\
-10, & \text{if distance increases after action} \\
0.5, & \text{if distance continues decreasing} \\
-0.5, & \text{if distance continues increasing}
\end{cases}$$

This encourages consistent progress toward targets, akin to choreography in formation drone light shows.

In simulation, the environment was configured with a 40m x 40m area, representing a typical performance space for formation drone light shows. Drones were modeled with simplified dynamics, responding to velocity commands within [-1, 1] m/s per axis. The control period of 0.3 seconds ensured responsive updates, matching real-world flight controllers. Collision thresholds were set at $s_1 = 2$ m for warning and $s_2 = 0.5$ m for collision, with penalties tuned to prioritize safety. These settings mirror the constraints of actual formation drone light shows, where close formations are visually appealing but risky.

The curriculum learning approach specifically addressed the sparse reward problem common in reinforcement learning. By starting with easy tasks, drones quickly learned basic navigation, which bootstrapped more precise control in later stages. This method is analogous to training for formation drone light shows, where simple patterns are mastered before advancing to complex sequences. The algorithm’s performance was evaluated using average episode reward and success rate, both showing monotonic improvement across stages. Comparative analyses with baseline methods like DDPG or independent Q-learning revealed superior coordination in MADDPG, attributed to its centralized critics.

For real-world validation, the trained policies were deployed on drones via Robot Operating System (ROS) nodes, interfacing with the flight controller. The drones successfully performed the triangle formation multiple times, with trajectories logged for analysis. The results indicated sub-meter accuracy in positioning, sufficient for most formation drone light show applications. However, latency in communication and sensor noise occasionally caused oscillations, highlighting the need for robust filtering and faster inference. Future iterations could incorporate on-board neural network accelerators to mitigate these issues.

From an application perspective, this technology can transform formation drone light shows by enabling adaptive and resilient performances. For example, drones could dynamically adjust formations in response to weather or audience movement, creating interactive experiences. Moreover, the same framework could be extended to other multi-robot systems, such as search-and-rescue or agricultural monitoring, demonstrating the versatility of MADDPG with curriculum learning.

In summary, the fusion of MADDPG and curriculum learning offers a powerful solution for autonomous formation drone light show control. The algorithm’s design emphasizes collaboration through centralized training, while its execution remains decentralized, aligning with the practical demands of aerial displays. Through systematic experiments, this approach has proven effective in simulation and real flights, laying a foundation for future innovations. As research progresses, incorporating more drones and complex patterns will further push the boundaries of what is possible in formation drone light shows, ultimately creating more captivating and intelligent aerial spectacles.