Robust Cooperative Control for Quadrotor Drones Using Reinforcement Learning

As a researcher in the field of autonomous systems, I have long been fascinated by the challenge of coordinating multiple quadrotor drones. These agile, versatile platforms hold immense promise for applications ranging from cooperative surveillance and search-and-rescue to automated logistics and communication relay networks. However, the practical realization of robust, optimal formation control for a cluster of quadrotor drones remains a formidable problem. Each quadrotor drone is inherently a nonlinear, underactuated system. Its precise dynamical model is often difficult to obtain due to complex aerodynamic effects, manufacturing variances, and unmodeled payload dynamics. Furthermore, during cooperative flight, the cluster is subject to external disturbances such as wind gusts. Traditional model-based control strategies often falter under these conditions, as they rely heavily on accurate system identification. This has motivated my exploration into data-driven methods, particularly reinforcement learning (RL), which promises to learn optimal control policies directly from interaction data, without requiring an explicit model. In this article, I present a comprehensive framework for the model-free, robust optimal formation control of a heterogeneous cluster of quadrotor drones. The core of our approach lies in synthesizing a fully distributed observer with reinforcement learning-based controllers for both position and attitude dynamics, enabling the swarm to achieve and maintain a desired formation despite unknown dynamics and disturbances.

The fundamental control problem for a single quadrotor drone is complex. The vehicle has six degrees of freedom (position and orientation in 3D space) but only four independent control inputs (the thrusts of the four rotors). This underactuation necessitates a hierarchical control structure: a high-level controller generates a desired acceleration vector, which is then translated into a desired attitude and collective thrust; a low-level attitude controller then tracks this desired orientation. When scaling to multiple drones, the communication network and the need for distributed coordination add another layer of complexity. We consider a heterogeneous cluster of $N$ quadrotor drones. The dynamics of the $i$-th quadrotor drone are described in the inertial frame. Let $ \mathbf{p}_i = [x_i, y_i, z_i]^T \in \mathbb{R}^3 $ denote the position vector, and $ \boldsymbol{\Theta}_i = [\phi_i, \theta_i, \psi_i]^T $ denote the attitude vector (roll, pitch, yaw). The nonlinear equations of motion can be summarized as follows:

$$
\begin{aligned}
\dot{\mathbf{p}}_i &= \mathbf{v}_i, \\
m_i \dot{\mathbf{v}}_i &= \mathbf{R}_i \mathbf{F}_i – m_i g \mathbf{e}_3 + \mathbf{d}_{pi}(t), \\
\dot{\boldsymbol{\Theta}}_i &= \mathbf{T}_i \boldsymbol{\omega}_i, \\
\mathbf{J}_i \dot{\boldsymbol{\omega}}_i &= \boldsymbol{\tau}_i – \boldsymbol{\omega}_i \times \mathbf{J}_i \boldsymbol{\omega}_i + \mathbf{d}_{\Theta i}(t).
\end{aligned}
$$

Here, $ m_i $ is the mass, $ \mathbf{J}_i $ is the inertia matrix, $ \mathbf{R}_i $ is the rotation matrix from body to inertial frame, $ \mathbf{F}_i = [0, 0, f_i]^T $ is the thrust vector in the body frame, and $ \boldsymbol{\tau}_i = [\tau_{\phi i}, \tau_{\theta i}, \tau_{\psi i}]^T $ is the torque vector. The terms $ \mathbf{d}_{pi}(t) $ and $ \mathbf{d}_{\Theta i}(t) $ represent bounded external disturbances acting on the translational and rotational dynamics, respectively. The control inputs for each quadrotor drone are typically the total thrust $ f_i $ and the three torques. These are related to the squared rotor speeds $ \Omega_{i,m}^2 $ ($ m=1,\dots,4 $) through an allocation matrix:

$$
\begin{bmatrix} f_i \\ \tau_{\phi i} \\ \tau_{\theta i} \\ \tau_{\psi i} \end{bmatrix} =
\begin{bmatrix}
k_f & k_f & k_f & k_f \\
0 & -l k_f & 0 & l k_f \\
l k_f & 0 & -l k_f & 0 \\
-k_\tau & k_\tau & -k_\tau & k_\tau
\end{bmatrix}
\begin{bmatrix} \Omega_{i,1}^2 \\ \Omega_{i,2}^2 \\ \Omega_{i,3}^2 \\ \Omega_{i,4}^2 \end{bmatrix},
$$

where $ k_f $, $ k_\tau $, and $ l $ are positive constants related to thrust coefficient, torque coefficient, and arm length, respectively. The underactuated nature is evident from the fact that the desired translational acceleration is achieved by orienting the thrust vector through attitude angles $ \phi_i $ and $ \theta_i $. Therefore, the overall control problem decouples into designing a position controller that outputs a desired attitude and collective thrust, and an attitude controller that tracks this desired attitude.

For cooperative control, the communication topology among the $N$ quadrotor drones is modeled as a directed graph $ \mathcal{G} = (\mathcal{V}, \mathcal{E}) $, where $ \mathcal{V} = \{1, 2, …, N\} $ is the node set and $ \mathcal{E} \subseteq \mathcal{V} \times \mathcal{V} $ is the edge set. The adjacency matrix $ \mathbf{W} = [w_{ij}] \in \mathbb{R}^{N \times N} $ is defined such that $ w_{ij} > 0 $ if information can flow from drone $ j $ to drone $ i $, and $ w_{ij} = 0 $ otherwise. The Laplacian matrix is $ \mathbf{L} = \mathbf{D} – \mathbf{W} $, where $ \mathbf{D} = \text{diag}(d_1, …, d_N) $ with $ d_i = \sum_{j=1}^N w_{ij} $. We assume the graph contains a directed spanning tree, ensuring that information from at least one drone (a leader or root) can propagate to all others.

The formation control objective is for the cluster of quadrotor drones to track a desired geometric pattern specified by relative offsets $ \boldsymbol{\delta}_i \in \mathbb{R}^3 $ while following a reference trajectory generated by a virtual leader. The virtual leader’s dynamics are described by a linear system:

$$
\dot{\mathbf{x}}_0 = \mathbf{A}_0 \mathbf{x}_0, \quad \mathbf{y}_0 = \mathbf{C}_0 \mathbf{x}_0,
$$

where $ \mathbf{y}_0(t) \in \mathbb{R}^3 $ is the leader’s position output. The desired reference position for the $i$-th quadrotor drone is $ \mathbf{p}_{ri}(t) = \mathbf{y}_0(t) + \boldsymbol{\delta}_i $. However, in a fully distributed setting, the global signal $ \mathbf{y}_0(t) $ is not available to all drones. Each drone must estimate its own reference signal using only local information exchanged with its neighbors. This leads to the first major component of our framework: the design of a fully distributed observer.

The distributed observer for the $i$-th quadrotor drone is designed as follows. Let $ \hat{\mathbf{x}}_{0i} $ be the estimate of the leader’s state $ \mathbf{x}_0 $, and $ \hat{\mathbf{p}}_{ri} $ be the estimate of the reference position $ \mathbf{p}_{ri} $. We propose the following dynamics:

$$
\begin{aligned}
\dot{\hat{\mathbf{x}}}_{0i} &= \mathbf{A}_0 \hat{\mathbf{x}}_{0i} + \eta \mathbf{K}_0 \sum_{j \in \mathcal{N}_i} w_{ij} (\hat{\mathbf{x}}_{0j} – \hat{\mathbf{x}}_{0i}), \\
\hat{\mathbf{p}}_{ri} &= \mathbf{C}_0 \hat{\mathbf{x}}_{0i} + \boldsymbol{\delta}_i,
\end{aligned}
$$

where $ \eta > 0 $ is a coupling gain, $ \mathbf{K}_0 $ is a design matrix chosen such that $ \mathbf{A}_0 – \lambda_i \mathbf{K}_0 $ is Hurwitz for all eigenvalues $ \lambda_i $ of the Laplacian matrix associated with the communication graph, and $ \mathcal{N}_i $ is the set of neighbors of drone $ i $. Under the assumption that the communication graph contains a spanning tree and the gain $ \eta $ is sufficiently large, it can be proven that the estimation errors $ \hat{\mathbf{x}}_{0i}(t) – \mathbf{x}_0(t) $ and $ \hat{\mathbf{p}}_{ri}(t) – \mathbf{p}_{ri}(t) $ converge to zero exponentially for all $ i $. This observer enables each quadrotor drone to reconstruct its required reference signal using only local neighbor exchanges, a critical step for scalable, distributed formation control.

With a reliable estimate $ \hat{\mathbf{p}}_{ri} $ available, the next step is to design the flight controllers. The overarching goal is to synthesize a control policy $ \mathbf{u}_i = [f_i, \boldsymbol{\tau}_i^T]^T $ that minimizes a certain performance index while ensuring robustness against disturbances and model uncertainties. We break this down into two sequential sub-problems: robust optimal position control and robust optimal attitude control, both solved via reinforcement learning.

For the position controller, we define the position tracking error for the $i$-th quadrotor drone as $ \mathbf{e}_{pi} = \mathbf{p}_i – \hat{\mathbf{p}}_{ri} $. Considering the translational dynamics and treating the attitude subsystem as a virtual actuator, we can define a virtual control input $ \mathbf{u}_{pi} = \mathbf{R}_i \mathbf{F}_i / m_i $. The translational error dynamics become:

$$
\dot{\mathbf{e}}_{pi} = \dot{\mathbf{p}}_i – \dot{\hat{\mathbf{p}}}_{ri} = \mathbf{v}_i – \dot{\hat{\mathbf{p}}}_{ri}, \quad \dot{\mathbf{v}}_i = \mathbf{u}_{pi} – g\mathbf{e}_3 + \frac{1}{m_i}\mathbf{d}_{pi}(t).
$$

We can combine these into an extended state system. Let $ \boldsymbol{\xi}_{pi} = [\mathbf{e}_{pi}^T, \mathbf{v}_i^T]^T $. The dynamics are:

$$
\dot{\boldsymbol{\xi}}_{pi} = \mathbf{A}_{pi} \boldsymbol{\xi}_{pi} + \mathbf{B}_{pi} \mathbf{u}_{pi} + \mathbf{B}_{di} \mathbf{w}_{pi},
$$

where $ \mathbf{w}_{pi} $ encapsulates the disturbance $ \mathbf{d}_{pi}/m_i $ and terms involving $ \dot{\hat{\mathbf{p}}}_{ri} $. The matrices $ \mathbf{A}_{pi}, \mathbf{B}_{pi}, \mathbf{B}_{di} $ are known from the system structure. The objective is to find a control policy $ \mathbf{u}_{pi} = \boldsymbol{\pi}_p(\boldsymbol{\xi}_{pi}) $ that minimizes a cost function while attenuating the effect of $ \mathbf{w}_{pi} $. This is a $ \mathcal{H}_\infty $ optimal control problem. We define the cost function as:

$$
J_p(\boldsymbol{\xi}_{pi}(0), \mathbf{u}_{pi}, \mathbf{w}_{pi}) = \int_0^\infty \left( \boldsymbol{\xi}_{pi}^T \mathbf{Q}_p \boldsymbol{\xi}_{pi} + \mathbf{u}_{pi}^T \mathbf{R}_p \mathbf{u}_{pi} – \gamma_p^2 \mathbf{w}_{pi}^T \mathbf{w}_{pi} \right) dt,
$$

where $ \mathbf{Q}_p \succeq 0 $, $ \mathbf{R}_p \succ 0 $, and $ \gamma_p > 0 $ is a disturbance attenuation level. The optimal policy $ \mathbf{u}_{pi}^* $ and the worst-case disturbance $ \mathbf{w}_{pi}^* $ are given by the solution of the Hamilton-Jacobi-Isaacs (HJI) equation:

$$
0 = \min_{\mathbf{u}_{pi}} \max_{\mathbf{w}_{pi}} \left[ \boldsymbol{\xi}_{pi}^T \mathbf{Q}_p \boldsymbol{\xi}_{pi} + \mathbf{u}_{pi}^T \mathbf{R}_p \mathbf{u}_{pi} – \gamma_p^2 \mathbf{w}_{pi}^T \mathbf{w}_{pi} + \frac{\partial V_p^*}{\partial \boldsymbol{\xi}_{pi}} \left( \mathbf{A}_{pi} \boldsymbol{\xi}_{pi} + \mathbf{B}_{pi} \mathbf{u}_{pi} + \mathbf{B}_{di} \mathbf{w}_{pi} \right) \right],
$$

where $ V_p^*(\boldsymbol{\xi}_{pi}) $ is the optimal value function. Solving this nonlinear partial differential equation analytically is intractable, especially without an accurate model. This is where reinforcement learning comes in. We employ an off-policy integral reinforcement learning (IRL) algorithm to learn the solution online using only data. The key is to use a Bellman equation formulation that does not require knowledge of the system drift dynamics $ \mathbf{A}_{pi} \boldsymbol{\xi}_{pi} $.

We collect data tuples $ (\boldsymbol{\xi}_{pi}, \mathbf{u}_{pi}, \dot{\boldsymbol{\xi}}_{pi}) $ or $ (\boldsymbol{\xi}_{pi}, \mathbf{u}_{pi}, \boldsymbol{\xi}_{pi}’) $ where $ \boldsymbol{\xi}_{pi}’ $ is the state at the next time sample. Using a parameterized approximator (e.g., a neural network) for the value function $ V_p(\boldsymbol{\xi}_{pi}; \mathbf{W}_p) $ and the policies $ \boldsymbol{\pi}_p(\boldsymbol{\xi}_{pi}; \boldsymbol{\Theta}_p) $, $ \boldsymbol{\pi}_{wp}(\boldsymbol{\xi}_{pi}; \boldsymbol{\Theta}_{wp}) $ for control and disturbance, we can solve the following Bellman error minimization problem:

$$
\delta_{pi} = V_p(\boldsymbol{\xi}_{pi}(t)) – V_p(\boldsymbol{\xi}_{pi}(t-T)) + \int_{t-T}^t \left( \boldsymbol{\xi}_{pi}^T \mathbf{Q}_p \boldsymbol{\xi}_{pi} + \mathbf{u}_{pi}^T \mathbf{R}_p \mathbf{u}_{pi} – \gamma_p^2 \mathbf{w}_{pi}^T \mathbf{w}_{pi} \right) d\tau,
$$

where $ T $ is a fixed time interval. By minimizing the squared Bellman error $ \delta_{pi}^2 $ over the parameters using data, the RL algorithm converges to approximations of the optimal value function and policies. The learned virtual control $ \mathbf{u}_{pi}^* $ is then decomposed to obtain the desired total thrust $ f_{i,des} $ and the desired attitude angles $ \phi_{i,des}, \theta_{i,des} $. Specifically, from $ \mathbf{u}_{pi}^* = [u_{x}^*, u_{y}^*, u_{z}^*]^T $, we have:

$$
f_{i,des} = m_i \| \mathbf{u}_{pi}^* + g\mathbf{e}_3 \|, \quad \phi_{i,des} = \arcsin\left( \frac{m_i (u_{x}^* \sin\psi_{i,des} – u_{y}^* \cos\psi_{i,des})}{f_{i,des}} \right), \quad \theta_{i,des} = \arctan\left( \frac{u_{x}^* \cos\psi_{i,des} + u_{y}^* \sin\psi_{i,des}}{u_{z}^* + g} \right),
$$

where $ \psi_{i,des} $ is a separately specified desired yaw angle, often set to a constant or a slowly varying command.

The second major component is the robust optimal attitude controller. Given the desired attitude $ \boldsymbol{\Theta}_{i,des} = [\phi_{i,des}, \theta_{i,des}, \psi_{i,des}]^T $ and its derivatives from the position controller, the attitude control objective is to track this reference robustly. Define the attitude tracking error $ \mathbf{e}_{\Theta i} = \boldsymbol{\Theta}_i – \boldsymbol{\Theta}_{i,des} $. The rotational dynamics are highly nonlinear. We consider an error dynamics system similar to the position case. Let $ \boldsymbol{\xi}_{\Theta i} = [\mathbf{e}_{\Theta i}^T, \boldsymbol{\omega}_i^T]^T $. The dynamics can be written as:

$$
\dot{\boldsymbol{\xi}}_{\Theta i} = \mathbf{f}_{\Theta i}(\boldsymbol{\xi}_{\Theta i}) + \mathbf{B}_{\Theta i} \boldsymbol{\tau}_i + \mathbf{B}_{d\Theta i} \mathbf{w}_{\Theta i},
$$

where $ \mathbf{f}_{\Theta i}(\cdot) $ contains the nonlinear Coriolis terms and coupling with the desired trajectory, and $ \mathbf{w}_{\Theta i} $ represents disturbances and model uncertainties. We define a similar $ \mathcal{H}_\infty $ optimal control problem with cost function:

$$
J_\Theta(\boldsymbol{\xi}_{\Theta i}(0), \boldsymbol{\tau}_i, \mathbf{w}_{\Theta i}) = \int_0^\infty \left( \boldsymbol{\xi}_{\Theta i}^T \mathbf{Q}_\Theta \boldsymbol{\xi}_{\Theta i} + \boldsymbol{\tau}_i^T \mathbf{R}_\Theta \boldsymbol{\tau}_i – \gamma_\Theta^2 \mathbf{w}_{\Theta i}^T \mathbf{w}_{\Theta i} \right) dt.
$$

Again, the associated HJI equation is solved online using an off-policy reinforcement learning algorithm. The algorithm learns an approximation of the optimal value function $ V_\Theta^*(\boldsymbol{\xi}_{\Theta i}) $ and the optimal torque policy $ \boldsymbol{\tau}_i^* = \boldsymbol{\pi}_\Theta(\boldsymbol{\xi}_{\Theta i}) $. This policy directly outputs the three control torques required to track the desired attitude robustly.

The synergy between the distributed observer, the RL-based position controller, and the RL-based attitude controller forms our complete solution for the robust cooperative control of quadrotor drones. A key advantage is that the reinforcement learning components do not require prior knowledge of the drone’s mass $ m_i $, inertia $ \mathbf{J}_i $, or the exact form of nonlinearities and disturbances. They learn from operational data in real-time or during a training phase. The distributed observer ensures that the formation pattern is maintained using only local communication, enhancing scalability and robustness to network changes.

To summarize the overall architecture, the following table outlines the main components and their functions for a single quadrotor drone in the cluster:

Component	Inputs	Outputs	Function	Key Feature
Distributed Observer	Neighbors’ state estimates $ \hat{\mathbf{x}}_{0j} $	Local reference position estimate $ \hat{\mathbf{p}}_{ri} $	Reconstructs global reference signal from local data	Fully distributed, requires only neighbor communication
RL Position Controller	Position $ \mathbf{p}_i $, velocity $ \mathbf{v}_i $, reference $ \hat{\mathbf{p}}_{ri} $	Desired thrust $ f_{i,des} $, desired attitude $ \phi_{i,des}, \theta_{i,des} $	Minimizes position tracking cost with disturbance attenuation	Model-free, learns optimal policy online
RL Attitude Controller	Attitude $ \boldsymbol{\Theta}_i $, angular rate $ \boldsymbol{\omega}_i $, desired attitude $ \boldsymbol{\Theta}_{i,des} $	Control torques $ \tau_{\phi i}, \tau_{\theta i}, \tau_{\psi i} $	Minimizes attitude tracking cost with disturbance attenuation	Model-free, handles nonlinear dynamics
Actuator Allocation	Thrust $ f_i $, torques $ \boldsymbol{\tau}_i $	Rotor speeds $ \Omega_{i,1}^2, …, \Omega_{i,4}^2 $	Converts high-level commands to individual motor commands	Standard quadratic mapping

The reinforcement learning algorithms used for both position and attitude control follow a similar iterative structure. We can formalize the general steps for the attitude controller as an example:

Off-policy IRL Algorithm for Attitude Control

Initialization: Start with an initial admissible control policy $ \boldsymbol{\pi}_\Theta^{(0)} $ and disturbance policy $ \boldsymbol{\pi}_{w\Theta}^{(0)} $. Set exploration noise to persist excitation.
Data Collection: Apply a behavior policy (e.g., $ \boldsymbol{\pi}_\Theta^{(k)} + \text{noise} $) to the quadrotor drone and collect state-input trajectories $ \{ \boldsymbol{\xi}_{\Theta i}(t), \boldsymbol{\tau}_i(t), \boldsymbol{\xi}_{\Theta i}(t+T) \} $ over time intervals of length $ T $.
Policy Evaluation (Value Function Approximation): For the current policies $ \boldsymbol{\pi}_\Theta^{(k)}, \boldsymbol{\pi}_{w\Theta}^{(k)} $, solve for the parameters $ \mathbf{W}_\Theta^{(k)} $ of the value function approximator $ V_\Theta(\boldsymbol{\xi}_{\Theta i}; \mathbf{W}_\Theta) $ by minimizing the Bellman error:
$$ \sum_{s} \left[ V_\Theta(\boldsymbol{\xi}_{\Theta i}(t_s); \mathbf{W}_\Theta) – V_\Theta(\boldsymbol{\xi}_{\Theta i}(t_s+T); \mathbf{W}_\Theta) + \int_{t_s}^{t_s+T} l(\boldsymbol{\xi}_{\Theta i}, \boldsymbol{\pi}_\Theta^{(k)}, \boldsymbol{\pi}_{w\Theta}^{(k)}) d\tau \right]^2, $$
where $ l(\cdot) = \boldsymbol{\xi}_{\Theta i}^T \mathbf{Q}_\Theta \boldsymbol{\xi}_{\Theta i} + (\boldsymbol{\pi}_\Theta^{(k)})^T \mathbf{R}_\Theta \boldsymbol{\pi}_\Theta^{(k)} – \gamma_\Theta^2 (\boldsymbol{\pi}_{w\Theta}^{(k)})^T \boldsymbol{\pi}_{w\Theta}^{(k)} $.
Policy Improvement: Update the control and disturbance policies using the gradient of the approximated value function:
$$ \boldsymbol{\pi}_\Theta^{(k+1)} = -\frac{1}{2} \mathbf{R}_\Theta^{-1} \mathbf{B}_{\Theta i}^T \frac{\partial V_\Theta(\boldsymbol{\xi}_{\Theta i}; \mathbf{W}_\Theta^{(k)})}{\partial \boldsymbol{\xi}_{\Theta i}}, \quad \boldsymbol{\pi}_{w\Theta}^{(k+1)} = \frac{1}{2 \gamma_\Theta^2} \mathbf{B}_{d\Theta i}^T \frac{\partial V_\Theta(\boldsymbol{\xi}_{\Theta i}; \mathbf{W}_\Theta^{(k)})}{\partial \boldsymbol{\xi}_{\Theta i}}. $$
In practice, these policies are also approximated by parameterized networks whose parameters are updated.
Iteration: Set $ k \leftarrow k+1 $ and repeat from step 2 until convergence (i.e., when the policy update becomes negligible).

An analogous procedure is used for the position controller. The convergence of such off-policy IRL algorithms to the solution of the HJI equation has been established in the literature under conditions of persistence of excitation and proper choice of function approximators.

The performance of the proposed framework was validated through extensive numerical simulations. We considered a heterogeneous cluster of four quadrotor drones. The key system parameters for each quadrotor drone were different, emphasizing the heterogeneity:

Drone Index	Mass $ m_i $ (kg)	Inertia Matrix $ \mathbf{J}_i $ (10⁻³ kg·m²)
1	1.0	diag(4.8, 4.8, 8.4)
2	1.5	diag(4.9, 4.9, 10.0)
3	1.7	diag(5.0, 5.0, 10.1)
4	1.8	diag(5.1, 5.1, 10.2)

The communication graph was directed with a spanning tree, defined by adjacency weights $ w_{12}=1, w_{23}=1, w_{34}=1, w_{41}=1 $. The virtual leader generated a sinusoidal trajectory. The desired formation was a square pattern with specific offsets. External disturbances were injected into both translational and rotational dynamics of each quadrotor drone, modeled as time-varying signals: $ \mathbf{d}_{pi}(t) = 0.5[\sin(0.5t), \cos(0.3t), 0.2\sin(t)]^T $ N and $ \mathbf{d}_{\Theta i}(t) = 0.01[\sin(0.4t), \cos(0.5t), \sin(0.3t)]^T $ N·m.

For the RL controllers, the cost function parameters were chosen as $ \mathbf{Q}_p = \text{diag}(10,10,10,1,1,1) $, $ \mathbf{R}_p = 0.1\mathbf{I}_3 $, $ \gamma_p = 5 $ for position, and $ \mathbf{Q}_\Theta = \text{diag}(50,50,50,5,5,5) $, $ \mathbf{R}_\Theta = 0.01\mathbf{I}_3 $, $ \gamma_\Theta = 2 $ for attitude. The learning algorithms used neural networks with two hidden layers as function approximators. Data was collected in an initial training phase where the drones executed random maneuvers with exploration noise. After the learning phase, the trained policies were deployed for formation flight.

The simulation results demonstrated the effectiveness of the proposed approach. All four quadrotor drones successfully achieved and maintained the desired square formation while accurately tracking the leader’s trajectory. The position tracking errors for all drones converged to small neighborhoods around zero despite the heterogeneous parameters and external disturbances. The attitude tracking errors also remained bounded and small, confirming that the RL-based attitude controller effectively managed the nonlinear rotational dynamics. The distributed observer ensured that each quadrotor drone correctly estimated its reference position using only information from its designated neighbor, validating the fully distributed nature of the solution.

The robustness of the control system is a critical aspect. By formulating the problem as a zero-sum game and solving the $ \mathcal{H}_\infty $ optimal control problem via reinforcement learning, the learned policies inherently provide a level of disturbance attenuation. This means that the performance degradation due to external disturbances like wind is minimized within the specified $ \gamma $ bound. For a quadrotor drone operating in real-world conditions, this is a vital feature. Furthermore, the model-free nature of the learning algorithm means that the same controller structure can be applied to different quadrotor drones with varying mass and inertia properties without manual retuning—the RL algorithm automatically adapts its policy to the specific dynamics of each platform during learning.

In conclusion, the integration of distributed estimation with model-free reinforcement learning offers a powerful and flexible framework for the cooperative control of quadrotor drone swarms. The methodology addresses several key challenges simultaneously: uncertainty in system dynamics, external disturbances, underactuation, and the need for distributed coordination. The core innovations include the fully distributed observer that eliminates the need for global reference information, and the twin RL-based optimal controllers that learn robust control policies directly from data, without requiring an explicit mathematical model of the complex quadrotor drone dynamics. This approach paves the way for more autonomous, scalable, and resilient multi-drone systems capable of performing sophisticated cooperative tasks in unpredictable environments. Future work will focus on extending the framework to handle dynamic communication topologies, incorporating safety constraints directly into the learning process, and conducting real-world flight experiments with a physical cluster of quadrotor drones.

The mathematical formulations and algorithms presented here provide a solid foundation. The key equations governing the dynamics of a single quadrotor drone are essential for understanding the control challenge. For clarity, the translation dynamics considering the virtual control input are:

$$
m_i \dot{\mathbf{v}}_i = m_i (\mathbf{u}_{pi} – g\mathbf{e}_3) + \mathbf{d}_{pi}.
$$

The rotation matrix $ \mathbf{R}_i $ as a function of Euler angles is:

$$
\mathbf{R}_i =
\begin{bmatrix}
c\theta c\psi & s\phi s\theta c\psi – c\phi s\psi & c\phi s\theta c\psi + s\phi s\psi \\
c\theta s\psi & s\phi s\theta s\psi + c\phi c\psi & c\phi s\theta s\psi – s\phi c\psi \\
-s\theta & s\phi c\theta & c\phi c\theta
\end{bmatrix},
$$

where $ s\cdot $ and $ c\cdot $ denote sine and cosine. The transformation matrix $ \mathbf{T}_i $ relating angular rates to Euler angle derivatives is:

$$
\mathbf{T}_i =
\begin{bmatrix}
1 & s\phi t\theta & c\phi t\theta \\
0 & c\phi & -s\phi \\
0 & s\phi/c\theta & c\phi/c\theta
\end{bmatrix},
$$

where $ t\cdot $ denotes tangent. These nonlinear relationships are handled implicitly by the reinforcement learning controllers.

Finally, the Bellman equation central to the off-policy IRL algorithm, written in a more general form for a system $ \dot{\mathbf{x}} = \mathbf{f}(\mathbf{x}) + \mathbf{B}_u \mathbf{u} + \mathbf{B}_w \mathbf{w} $, is:

$$
V(\mathbf{x}(t)) – V(\mathbf{x}(t-T)) = -\int_{t-T}^t \left( \mathbf{x}^T \mathbf{Q} \mathbf{x} + \mathbf{u}^T \mathbf{R} \mathbf{u} – \gamma^2 \mathbf{w}^T \mathbf{w} \right) d\tau + \int_{t-T}^t \frac{\partial V}{\partial \mathbf{x}} (\mathbf{B}_u \mathbf{u} + \mathbf{B}_w \mathbf{w}) d\tau.
$$

By parameterizing $ V $, $ \mathbf{u} $, and $ \mathbf{w} $, and using data for $ \mathbf{x} $ and the integrals, this equation becomes linear in the parameters, enabling efficient least-squares-based learning. This elegant formulation is what allows each quadrotor drone in the cluster to learn its own optimal, robust controller in a truly model-free fashion, bringing us closer to the vision of intelligent, autonomous swarms of quadrotor drones.