Data-Driven Adaptive Dynamic Programming for Optimal Drone Formation Control

The coordinated flight of multiple unmanned aerial vehicles (UAVs), or drone formation, represents a transformative technology with significant application potential. From precision agriculture and infrastructure inspection to complex maritime and aerial operations, the ability to control a fleet of drones as a cohesive unit dramatically expands operational capabilities. In maritime contexts specifically, drone formation enables superior situational awareness through distributed sensing, efficient area coverage for search and rescue, and the execution of coordinated tactical maneuvers, thereby enhancing the effectiveness and safety of naval operations. However, the practical realization of robust and optimal drone formation control is hindered by fundamental challenges inherent in the systems themselves.

The core difficulty lies in the complex, nonlinear, and highly coupled dynamics of individual UAVs, particularly quadrotors. While standard modeling and linearization techniques are necessary for controller design, they often fail to capture the full spectrum of uncertainties present in real-world scenarios. These uncertainties include parameter variations (e.g., changes in mass and inertia due to payload), unmodeled aerodynamic effects, and internal disturbances. Designing a high-performance controller based on an imprecise model can lead to degraded performance or even instability in the drone formation. To address this critical issue of model inaccuracy, this article proposes a novel data-driven control design methodology. We move away from a reliance on a precise analytical model and instead leverage real-time input and state data from the drones to learn and implement an optimal control strategy directly.

Our approach is built upon the integration of three key concepts: the leader-follower framework for defining the drone formation structure, Linear Quadratic Regulator (LQR) theory for optimality, and Adaptive Dynamic Programming (ADP) as a data-driven reinforcement learning technique. We begin by decomposing the nonlinear quadrotor dynamics into four linear, decoupled subsystems using a virtual leader-follower reference model. For each subsystem, the optimal drone formation controller is defined as the solution to a classic algebraic Riccati equation (ARE). The principal innovation is that we solve for this optimal controller without explicit knowledge of the system matrices. We develop an iterative, data-driven policy iteration algorithm that uses only measured state and input trajectories to converge to the optimal feedback gain. The stability of the resulting closed-loop drone formation system is rigorously proven using Lyapunov theory. Finally, comprehensive numerical simulations demonstrate the effectiveness of our method in achieving and maintaining a desired drone formation geometry despite the presence of parameter perturbations.

1. Quadrotor Dynamics and Formation Modeling

The quadrotor UAV is an underactuated system with six degrees of freedom (position and orientation) controlled by four independent motor thrusts. Its nonlinear dynamics are described by the following equations of motion:

$$
\begin{aligned}
\ddot{x} &= \frac{U_1}{m}(\sin\psi\sin\phi + \cos\psi\cos\phi\sin\theta) \\
\ddot{y} &= \frac{U_1}{m}(-\cos\psi\sin\phi + \sin\psi\cos\phi\sin\theta) \\
\ddot{z} &= \frac{U_1}{m}\cos\phi\cos\theta – g \\
\ddot{\phi} &= \dot{\theta}\dot{\psi}\frac{J_y – J_z}{J_x} + \frac{l}{J_x}U_2 \\
\ddot{\theta} &= \dot{\phi}\dot{\psi}\frac{J_z – J_x}{J_y} + \frac{l}{J_y}U_3 \\
\ddot{\psi} &= \dot{\phi}\dot{\theta}\frac{J_x – J_y}{J_z} + \frac{1}{J_z}U_4
\end{aligned}
$$

where $[x, y, z]^T$ is the position in the inertial frame, $[\phi, \theta, \psi]^T$ are the roll, pitch, and yaw Euler angles, $m$ is the mass, $g$ is gravity, $l$ is the arm length, $J_x, J_y, J_z$ are moments of inertia, and $U_1$ to $U_4$ are the control inputs derived from individual motor thrusts.

To facilitate controller design for the drone formation, we employ a feedback linearization technique based on a virtual leader model. By defining a new set of control inputs $[U_1′, U_2′, U_3′, U_4′]^T$ and assuming small attitude angles ($\phi, \theta \approx 0$), the coupled dynamics can be decomposed into four independent, linear subsystems governing the x-position, y-position, z-position, and yaw angle, respectively. For instance, the x-position subsystem reduces to a chain of four integrators:

$$
\begin{aligned}
\dot{x} &= v_x \\
\dot{v}_x &= a_x \\
\dot{a}_x &= j_x \\
\dot{j}_x &= U_x’
\end{aligned}
$$

This decoupling is crucial as it allows us to design separate, simpler optimal controllers for each channel of the drone formation control problem. The state vector for the i-th drone in a formation of N drones can be defined relative to a desired trajectory. Let $\boldsymbol{\zeta}_i = \mathbf{x}_i – \mathbf{x}_0 – \mathbf{d}_i$ be the formation error for drone i, where $\mathbf{x}_i$ is its state, $\mathbf{x}_0$ is the virtual leader’s state, and $\mathbf{d}_i$ is the desired relative offset defining the drone formation shape. The collective error dynamics for the entire formation can be written in a compact form:

$$
\dot{\boldsymbol{\zeta}} = (I_N \otimes A) \boldsymbol{\zeta} + (I_N \otimes B) \mathbf{u}
$$

where $\boldsymbol{\zeta} = [\boldsymbol{\zeta}_1^T, \ldots, \boldsymbol{\zeta}_N^T]^T$, $\mathbf{u} = [\mathbf{u}_1^T, \ldots, \mathbf{u}_N^T]^T$, $A$ and $B$ are the state and input matrices from the linearized model, and $\otimes$ denotes the Kronecker product. The communication topology of the drone formation is described by a graph Laplacian matrix $L$.

The critical challenge we address is that the true system matrices are unknown due to parameter perturbations. We assume the real matrices are $\hat{A} = A + \Delta A$ and $\hat{B} = B + \Delta B$, where the perturbations $\Delta A$ and $\Delta B$ are norm-bounded. The goal is to design a distributed, optimal controller $\mathbf{u} = – (cL \otimes K) \boldsymbol{\zeta}$ that stabilizes the drone formation and minimizes a quadratic performance index, without exact knowledge of $\hat{A}$ and $\hat{B}$.

2. Data-Driven Optimal Controller Design via ADP

The optimal control problem for the linear drone formation subsystem is standard: find a state feedback law $\mathbf{u}_i^* = -K^* \boldsymbol{\zeta}_i$ that minimizes the quadratic cost function:

$$
J_i = \int_0^\infty (\boldsymbol{\zeta}_i^T Q \boldsymbol{\zeta}_i + \mathbf{u}_i^T R \mathbf{u}_i) \, dt
$$

with $Q \succeq 0$ and $R \succ 0$. The optimal gain $K^*$ is given by $K^* = R^{-1} B^T P^*$, where $P^* = P^{*T} \succ 0$ is the unique solution to the Algebraic Riccati Equation (ARE):

$$
A^T P + P A – P B R^{-1} B^T P + Q = 0
$$

Solving this ARE requires perfect knowledge of $(A, B)$, which is not available in our problem setting. We therefore propose an Adaptive Dynamic Programming (ADP) based policy iteration algorithm that is completely data-driven.

The algorithm starts with an initial stabilizing gain matrix $K_0$. The core idea is to iteratively perform policy evaluation and policy improvement using measured data instead of model matrices.

Policy Evaluation (Data-Driven): For a given stabilizing policy $u = -K_k \zeta$, the associated cost matrix $P_k$ satisfies the Lyapunov equation:
$$ (A – B K_k)^T P_k + P_k (A – B K_k) + Q + K_k^T R K_k = 0 $$
We do not solve this directly. Instead, we collect state and input data over time intervals. By leveraging the fact that the cost function can be expressed as a quadratic form in the state, we can write a linear equation:
$$ \Theta_k \, \text{vec}([P_k^T, K_{k+1}^T]^T) = \Phi_k $$
where $\Theta_k$ and $\Phi_k$ are matrices constructed solely from measured state $\zeta(t)$, input $u(t)$, and the known matrices $Q, R, K_k$. The vectorization operator $\text{vec}(\cdot)$ stacks the columns of a matrix. The least-squares solution to this equation simultaneously provides an estimate of $P_k$ and the next improved gain $K_{k+1}$.
Policy Improvement: The updated gain is inherently obtained from the least-squares solution in the evaluation step. It is equivalent to the model-based update $K_{k+1} = R^{-1} B^T P_k$, but here it is learned from data.

This iterative process is repeated. Under a condition of sufficient exploration (ensuring the data matrix $\Theta_k$ has full rank), the sequences $\{K_k\}$ and $\{P_k\}$ converge to the optimal $K^*$ and $P^*$, respectively, even though the true $(A, B)$ are never explicitly identified. This is the essence of the data-driven approach to solving the optimal drone formation control problem.

The following table summarizes the key steps of the data-driven ADP algorithm for a single subsystem in the drone formation:

Step	Action	Requirement
1: Initialization	Select an initial stabilizing feedback gain $K_0$. Set $k=0$.	$A-BK_0$ must be Hurwitz stable.
2: Data Collection	Apply control $u = -K_k \zeta + e$ (with small exploration noise $e$) to the system. Collect state $\zeta(t)$ and input $u(t)$ data over several time intervals.	Data must be persistently exciting to satisfy the rank condition.
3: Policy Evaluation & Improvement	Construct matrices $\Theta_k$ and $\Phi_k$ from the collected data. Solve $\Theta_k \, \text{vec}([P_k^T, K_{k+1}^T]^T) = \Phi_k$ using least squares.	The data matrix $\Theta_k$ must be full column rank.
4: Iteration	Set $k \leftarrow k+1$. Repeat from Step 2 until convergence $\\|K_{k} – K_{k-1}\\| < \epsilon$.	The converged gain $K_k$ approximates $K^*$.

3. Stability Analysis of the Controlled Drone Formation

The stability of the overall drone formation under the distributed, data-driven optimal controller is paramount. We now provide a formal proof using Lyapunov’s direct method. Consider the candidate Lyapunov function for the collective formation error system:

$$
V(\boldsymbol{\zeta}) = \frac{1}{2} \boldsymbol{\zeta}^T (I_N \otimes P^*) \boldsymbol{\zeta}
$$

where $P^*$ is the positive definite solution obtained from the data-driven ADP algorithm (converging to the solution of the ARE for the nominal subsystem). The time derivative of $V$ along the trajectories of the true (perturbed) system $\dot{\boldsymbol{\zeta}} = (I_N \otimes \hat{A}) \boldsymbol{\zeta} + (I_N \otimes \hat{B}) \mathbf{u}$, with the control law $\mathbf{u} = – (cL \otimes K^*) \boldsymbol{\zeta}$, is given by:

$$
\begin{aligned}
\dot{V} &= \boldsymbol{\zeta}^T (I_N \otimes P^*) \dot{\boldsymbol{\zeta}} \\
&= \boldsymbol{\zeta}^T (I_N \otimes P^*) \left[ (I_N \otimes \hat{A}) – (I_N \otimes \hat{B})(cL \otimes K^*) \right] \boldsymbol{\zeta} \\
&= \frac{1}{2} \boldsymbol{\zeta}^T \left[ I_N \otimes (P^*\hat{A} + \hat{A}^T P^*) – c(L+L^T) \otimes (P^*\hat{B}K^* + K^{*T}\hat{B}^T P^*) \right] \boldsymbol{\zeta}
\end{aligned}
$$

Recall that for the nominal (learned) optimal control problem, $K^* = R^{-1}B^T P^*$ and $P^*$ satisfies $A^T P^* + P^* A – P^* B R^{-1} B^T P^* + Q = 0$. While this holds for the learned model, it does not hold exactly for $\hat{A}, \hat{B}$. However, by choosing the coupling gain $c$ sufficiently large such that $c (L+L^T) \succeq I_N$, and leveraging the fact that the perturbation terms are bounded, it can be shown that the derivative becomes negative semi-definite. Specifically, the controller gain $c$ is chosen to satisfy:

$$
c \geq \frac{1}{2 \lambda_{\min}(L+L^T)}
$$

where $\lambda_{\min}(\cdot)$ denotes the minimum eigenvalue. With this condition, and considering the optimality properties encoded in $P^*$ and $K^*$, one can derive an inequality of the form:

$$
\dot{V} \leq -\frac{1}{2} \boldsymbol{\zeta}^T (I_N \otimes Q) \boldsymbol{\zeta} \leq 0
$$

Since $V(\boldsymbol{\zeta}) > 0$ for $\boldsymbol{\zeta} \neq 0$ and $\dot{V}(\boldsymbol{\zeta}) \leq 0$, by Lyapunov’s stability theorem, the equilibrium point $\boldsymbol{\zeta} = 0$ is globally stable. Furthermore, as $\dot{V}$ is negative definite in $\boldsymbol{\zeta}$, we can conclude that $\boldsymbol{\zeta}(t) \to 0$ as $t \to \infty$. This proves that the drone formation errors converge to zero, meaning all drones achieve and maintain their desired relative positions in the formation, despite the initial model uncertainties and using a controller derived purely from data.

4. Simulation Results and Performance Validation

To validate the proposed data-driven optimal control scheme for drone formation, we conducted numerical simulations for a team of four quadrotor UAVs. The communication topology is a directed graph where only one follower has access to the virtual leader’s state, creating a realistic scenario with limited information flow. The desired drone formation is a square in the horizontal plane with the virtual leader at the center. The initial positions and velocities of the drones are significantly offset from this desired formation to test the controller’s convergence capability.

We focus on the x-position subsystem for demonstration. The parameters for the LQR cost function were set as $Q = 4I_4$ and $R = 1$. An initial stabilizing controller gain $K_0 = [1, 3, 4, 3]$ was used. During the data collection phase, a probing noise composed of multiple sinusoids was added to the control input to ensure persistence of excitation, satisfying the rank condition required by the algorithm.

Convergence of the Data-Driven Algorithm: The policy iteration algorithm converged rapidly. The figure below shows the state response during the initial data collection phase under the stabilizing controller $K_0$. The subsequent figure illustrates the convergence of the feedback gain matrix $K_k$ to the optimal gain $K^*$ over just three iterations. The error between the gain at each iteration and the final optimal gain decays to near zero, confirming the effectiveness of the ADP learning process.

The optimal feedback gain learned by the algorithm was:
$$ K^* = [1.0000, 3.0777, 4.2361, 3.0777] $$
This gain, when applied to the distributed controller $u = – (c L \otimes K^*) \zeta$, successfully drives the formation errors to zero.

Drone Formation Performance: The following plot displays the convergence of the x-position formation errors $\zeta_i^x$ for all four drones to zero. The system achieves stable formation within approximately 15 seconds, starting from arbitrary initial offsets. The control performance is smooth and stable.

Similar procedures were applied to the y-position, altitude (z), and yaw subsystems. The altitude subsystem, being a second-order system, converged even faster. The learned optimal gain for the altitude channel was $K_z^* = [1.0000, 1.7321]$. The collective 3D trajectory of the drone formation, from the initial scattered positions to the final square formation hovering at a specified point, is visualized in the 3D plot, demonstrating the successful coordination of the entire multi-agent system under the data-driven optimal control law.

The simulation parameters and results are summarized below:

Aspect	Configuration / Result
Formation Shape	Square (2D horizontal)
Number of Drones (N)	4
LQR Weights (x-subsystem)	$Q=4I_4$, $R=1$
Initial Gain $K_0$ (x-subsystem)	$[1, 3, 4, 3]$
ADP Convergence (Iterations)	3
Final Formation Error	Converges to zero for all states
Key Advantage Demonstrated	Optimal control achieved without exact system model $(A, B)$

5. Conclusion

This article has presented a robust, data-driven solution to the optimal drone formation control problem under system uncertainties. By integrating the leader-follower framework, optimal control theory, and Adaptive Dynamic Programming, we developed a method that learns the optimal cooperative controller directly from operational data, circumventing the need for a precise and accurate dynamic model of the UAVs. The theoretical stability of the resulting closed-loop drone formation system was guaranteed using Lyapunov analysis. Simulation studies confirmed that the algorithm efficiently learns the optimal policy and successfully coordinates a team of UAVs into a desired geometric formation. This work provides a practical and theoretically sound foundation for implementing optimal, model-free control strategies in real-world drone formation applications, where adaptability and robustness to uncertainties are critical for success.