Drone Formation Control via Multi-Agent Deep Reinforcement Learning and Curriculum Strategy

The coordination of multiple Unmanned Aerial Vehicles (UAVs) into a cohesive drone formation represents a significant frontier in autonomous systems, promising transformative impacts across diverse sectors. In disaster response, a drone formation can provide rapid aerial assessment, map extensive damage, and deliver critical supplies. For military operations, a coordinated drone formation enhances surveillance coverage, enables complex collaborative maneuvers, and improves overall mission resilience. The core challenge lies in developing control strategies that enable a group of UAVs to autonomously achieve and maintain a desired spatial configuration from arbitrary initial positions, while avoiding collisions and adapting to dynamic conditions.

Traditional methods for drone formation control, such as leader-follower, virtual structure, or behavior-based approaches, often rely heavily on accurate system models and environmental priors. Their performance can degrade under uncertainty or when the formation task is complex. Deep Reinforcement Learning (DRL) offers a paradigm shift. By learning optimal control policies directly through interaction with an environment, DRL agents can develop sophisticated, adaptive behaviors without explicit programming of low-level rules. This makes DRL exceptionally suitable for the nonlinear and high-dimensional problems inherent in multi-robot systems like drone formation.

While single-agent DRL has been applied to UAV navigation, it is fundamentally limited in multi-agent scenarios like drone formation. Treating each drone as an independent learner leads to a non-stationary environment; from any single drone’s perspective, the environment changes unpredictably as the other drones (which are also learning) update their policies. This instability often causes training to fail or converge to poor performance. The Multi-Agent Deep Deterministic Policy Gradient (MADDPG) algorithm was specifically conceived to address this challenge. It employs a centralized training with decentralized execution framework. During training, a centralized critic network for each agent has access to the observations and actions of all agents, stabilizing the learning process. During execution, each agent uses only its own local observations to make decisions, ensuring practicality in real-world applications with communication constraints. This paper explores the application of the MADDPG framework to the problem of cooperative drone formation control.

However, training a multi-agent system like a drone formation with MADDPG remains a formidable task. The joint action and state spaces grow exponentially with the number of agents, making the search for an effective policy akin to finding a needle in a haystack. To overcome this, we integrate the concept of Curriculum Learning (CL) into our reinforcement learning pipeline. The core idea is to break down the complex final task—precisely forming a specific shape—into a sequence of progressively more difficult sub-tasks. The agent first learns an easier version of the task (e.g., reaching a rough vicinity of the target), and this learned policy serves as an informed starting point for the next, harder stage (e.g., reaching a tighter vicinity). This staged learning approach dramatically reduces sample complexity and guides the agents toward successful policies for the ultimate drone formation goal.

Algorithmic Foundation: From DDPG to MADDPG

The Deep Deterministic Policy Gradient (DDPG) algorithm provides the foundation for MADDPG. Designed for continuous action spaces, DDPG is an actor-critic method that concurrently learns a deterministic policy (the actor, $\mu(s|\theta^\mu)$) and a state-action value function (the critic, $Q(s,a|\theta^Q)$). It utilizes four neural networks: the online actor and critic, and their corresponding target networks ($\mu’$ and $Q’$) for stable training. The critic is updated by minimizing the temporal-difference error, and the actor is updated by applying the chain rule to the expected return with respect to the actor parameters, using the gradient from the critic.

The extension to the multi-agent domain leads to the Multi-Agent Deep Deterministic Policy Gradient (MADDPG). Consider an environment with $N$ agents. Each agent $i$ has its own policy $\pi_i$, parameterized by $\theta_i$, which is a deterministic function $\mu_i(o_i)$ mapping its local observation $o_i$ to an action $a_i$. The key innovation is the structure of the critic. For agent $i$, the critic $Q_i^\mu(\mathbf{x}, a_1, …, a_N)$ is now a centralized function that takes as input the joint observation $\mathbf{x} = (o_1, …, o_N)$ and the joint action $(a_1, …, a_N)$ of all agents. This allows the critic to learn the true value of an action within the context of what all other agents are doing, resolving the non-stationarity issue during training.

The gradient for the actor of agent $i$ can be derived as:
$$\nabla_{\theta_i} J(\mu_i) = \mathbb{E}_{\mathbf{x}, a \sim \mathcal{D}} \left[ \nabla_{\theta_i} \mu_i(a_i|o_i) \nabla_{a_i} Q_i^{\mu}(\mathbf{x}, a_1, …, a_N) \big|_{a_i = \mu_i(o_i)} \right]$$
where $\mathcal{D}$ is a replay buffer containing joint experiences $(\mathbf{x}, \mathbf{x}’, a_1, …, a_N, r_1, …, r_N)$. The centralized critic $Q_i^{\mu}$ is updated by minimizing the following loss:
$$\mathcal{L}(\theta_i) = \mathbb{E}_{\mathbf{x}, a, r, \mathbf{x}’} \left[ \left( Q_i^{\mu}(\mathbf{x}, a_1, …, a_N) – y_i \right)^2 \right]$$
with the target $y_i$ given by:
$$y_i = r_i + \gamma Q_i^{\mu’}(\mathbf{x}’, a_1′, …, a_N’) \big|_{a_j’ = \mu_j'(o_j)}$$
Here, $\mu’ = \{\mu_1′, …, \mu_N’\}$ are the target policies with softly updated parameters. This formulation ensures that, given knowledge of all agents’ actions, the environment dynamics are stationary from the perspective of the critic, enabling stable training even as policies evolve.

Curriculum Learning for Structured Formation Training

Training a drone formation policy from scratch with sparse rewards (e.g., a reward only upon perfect formation) is highly inefficient and often unsuccessful. We employ a curriculum learning strategy to decompose the formation task. The final task requires three drones to form a precise equilateral triangle with a side length of 5 meters. We define this by a strict success threshold $d_{final}$ (e.g., 1 meter from individual target points). The curriculum consists of three stages:

Stage 1 – Loose Assembly: The success threshold is relaxed to $d_1 = 3m$. Agents receive reward for moving roughly towards their target positions. The primary goal is to learn basic collision avoidance and gross navigation without the pressure of precision.
Stage 2 – Intermediate Formation: The threshold is tightened to $d_2 = 2m$. The policy trained in Stage 1 is used as the initialization. The agents now must refine their coordination to achieve a closer approximation of the final drone formation.
Stage 3 – Precise Formation: The threshold is set to the final goal of $d_3 = 1m$. The policy from Stage 2 is fine-tuned to achieve the required precision, leveraging the foundational navigation and collision avoidance skills already acquired.

This hierarchical training is formalized in Algorithm 1. At each stage $j > 1$, the network parameters are initialized with the saved parameters from the previous stage $j-1$. For the first few episodes of the new stage, exploration noise is applied but the networks are not updated, allowing the replay buffer to be populated with successful experiences relevant to the new, slightly harder task. This provides a strong, informative starting point for the subsequent policy gradient updates, making the learning process for the complex drone formation task tractable.

System Design for Drone Formation Control

State and Action Representation

We consider the problem in a 2D plane at a fixed altitude. For a drone formation of $N=3$ agents, the local observation for drone $i$ is a vector containing its own positional error and velocity:
$$o_i = [\Delta x_i, \Delta y_i, v_{x_i}, v_{y_i}]$$
where $(\Delta x_i, \Delta y_i) = (x_i – x_i^{target}, y_i – y_i^{target})$. For the centralized critic during training, the joint observation is the concatenation of all local observations:
$$\mathbf{x} = [o_1, o_2, o_3]$$
The action for each drone is a 2D velocity command normalized to the range $[-1, 1]$, which is then scaled to the drone’s operational limits: $a_i = [\tilde{v}_{x_i}, \tilde{v}_{y_i}]$.

Shaped Reward Function for Guided Learning

A dense, shaped reward function is critical for guiding the drone formation. The total reward $r_i$ for agent $i$ is a sum of several components, each designed to instill a specific behavior. The reward components are summarized in the table below and detailed thereafter.

Component	Symbol	Purpose	Mathematical Form
Target Distance Reward	$r_{single}^i$	Encourage movement towards target	$-c_1 \cdot (\|\Delta x_i\| + \|\Delta y_i\|)$
Collision Avoidance Reward	$r_{danger}^{ij}$	Penalize proximity to other drones	See Eq. (1) below
Formation Success Reward	$r_{done}^i$	Large bonus for reaching target	$+R_{done}$ if $\|\|(\Delta x_i, \Delta y_i)\|\| < d_{done}$
Boundary Penalty	$r_{bound}^i$	Penalize leaving allowed area	$-R_{bound}$ if out-of-bounds
Proximity Reward	$r_{near}^i$	Manage behavior near target	Piecewise constant (enter/stay/leave)
Progress Reward	$r_{nearorfar}^i$	Encourage consistent approach	Piecewise constant (improving/worsening)

The collision reward between drone $i$ and $j$ uses an artificial potential field concept, providing a repulsive force that increases sharply at close range:
$$
r_{danger}^{ij} =
\begin{cases}
c_2 \cdot (s_1 – d_{ij}), & \text{if } s_2 < d_{ij} < s_1 \\
-R_{crash}, & \text{if } d_{ij} \le s_2
\end{cases}
$$
where $d_{ij}$ is the Euclidean distance between drones, $s_1$ is the safety threshold where repulsion begins, $s_2$ is the collision distance, and $R_{crash}$ is a large negative penalty. The total reward for drone 1, for instance, is:
$$r_1 = r_{single}^1 + r_{danger}^{12} + r_{danger}^{13} + r_{done}^1 + r_{bound}^1 + r_{near}^1 + r_{nearorfar}^1$$
This composite reward function provides the dense, incremental feedback necessary for the MADDPG algorithm to learn an effective drone formation policy.

Network Architecture and Parameters

The actor and critic networks are fully connected neural networks. The actor network for each agent has an architecture of [12, 64, 32, 16, 2], where the input layer size of 12 corresponds to the dimension of the joint observation $\mathbf{x}$. The output layer uses a $\tanh$ activation to produce actions in $[-1, 1]$. The centralized critic network for each agent has an architecture of [18, 64, 32, 32, 1], where the input layer size of 18 corresponds to the joint observation (12) plus the joint action (6). Key hyperparameters for training are listed below.

Hyperparameter	Value
Batch Size	64
Actor Learning Rate	0.0005
Critic Learning Rate	0.001
Discount Factor ($\gamma$)	0.85
Replay Buffer Size	15000
Target Network Update Rate ($\tau$)	0.005
Control Period	0.3 s

Experimental Results and Analysis

We developed a Software-in-the-Loop (SITL) simulation environment integrating Gazebo and the ArduPilot firmware to validate our approach. Three drones start at positions (0,0), (0,-3), and (0,3) with the goal of forming an equilateral triangle centered at the origin with targets at (0,-5), (-4.33,2.5), and (4.33,2.5).

The efficacy of the curriculum learning strategy is unequivocally demonstrated in the training curves. When training the final precise drone formation (1m threshold) directly from random initialization, the algorithm failed to converge even after 1500 episodes, as evidenced by the absence of high-reward episodes. In stark contrast, the curriculum-trained approach showed rapid and stable convergence at each stage. The Stage 1 (3m) policy converged in about 180 episodes. Using this as a foundation, Stage 2 (2m) converged again near 180 episodes. Finally, Stage 3 (1m) converged smoothly around 200 episodes. This ablation clearly shows that curriculum learning is essential for solving the challenging multi-agent credit assignment and exploration problem in drone formation.

The learned policy successfully coordinates the drone formation. Trajectory plots show all three drones navigating from their start points to their respective targets, maintaining safe separation throughout the maneuver, and achieving the final triangular configuration. To test robustness, we varied key hyperparameters. The algorithm maintained stable convergence when the discount factor $\gamma$ was adjusted to 0.8, 0.9, or 0.95, and when the learning rates were halved or doubled. This indicates the approach is not hyperparameter-sensitive. Furthermore, we tested generalization by assigning four different sets of random target locations. The trained algorithm, starting from the same initial positions, successfully learned to achieve the new formations in all cases, demonstrating that the policy learned generalizable coordination and navigation principles rather than memorizing a single path.

The ultimate validation was conducted in real-world flight tests using a fleet of three F450 drones equipped with Pixhawk flight controllers. The trained MADDPG policy was deployed on an onboard Raspberry Pi. The physical drones successfully executed the formation flight, replicating the simulated behavior by taking off, maneuvering to avoid each other, and settling into the prescribed triangular formation. This successful transfer from simulation to reality, known as Sim2Real, validates the practicality and robustness of the proposed curriculum-based MADDPG framework for real-world drone formation control. The potential applications of such reliable, autonomous drone formation technology are vast, ranging from search and rescue and precision agriculture to creating complex aerial light displays.

Conclusion and Future Work

This work presents a successful framework for autonomous drone formation control using the Multi-Agent Deep Deterministic Policy Gradient algorithm augmented with a Curriculum Learning strategy. By decomposing the complex formation task into progressively difficult stages and employing a carefully shaped reward function, we overcome the significant training challenges associated with multi-agent reinforcement learning. Extensive experiments in SITL simulation demonstrated the algorithm’s effectiveness, robustness to hyperparameter changes, and ability to generalize to new formation waypoints. The final real-world flight tests confirmed the policy’s viability outside of simulation, marking a significant step towards deployable multi-UAV systems.

Future work will focus on addressing current limitations and expanding capabilities. The policy can exhibit overshoot near targets due to the continuous action output; integrating a terminal phase with finer, lower-speed control could enhance precision. Scaling the system to larger formations with more than three drones is a critical next step, which may require investigating more scalable multi-agent architectures or hierarchical controllers. Finally, extending the framework to fully 3D formations and dynamic environments with moving obstacles or changing formation patterns will further increase its applicability to real-world scenarios, solidifying the role of deep reinforcement learning in the future of autonomous drone formation technology.