Formation Control of Unmanned Aerial Vehicles Using Stackelberg Game and Hierarchical Reinforcement Learning

In modern military and civilian applications, drone technology has revolutionized operations due to its agility, cost-effectiveness, and versatility. Unmanned Aerial Vehicle (UAV) systems are increasingly deployed for tasks such as surveillance, reconnaissance, and coordinated missions. However, single UAVs face limitations in sensor range, operational efficiency, and resilience to failures. To address these challenges, multi-UAV formations have emerged, enabling collaborative task execution through shared information and autonomous decision-making. This paper focuses on formation control for UAV swarms, where a leader-follower framework is adopted to maintain specific configurations during dynamic operations. By integrating Stackelberg game theory with hierarchical reinforcement learning, we propose a novel control strategy that optimizes performance metrics such as energy consumption and formation accuracy. The approach transforms the formation control problem into a differential game, solved via coupled Hamilton-Jacobi-Bellman (HJB) equations and value iteration algorithms. Simulation results validate the method’s effectiveness in achieving stable formations under realistic constraints.

The leader-follower architecture is central to our formulation, where one UAV acts as the leader, guiding the formation along a predefined trajectory, while follower UAVs adjust their positions relative to the leader. This structure reduces communication overhead and enhances scalability. Consider a system with $ N+1 $ UAVs, indexed as $ V = \{0, 1, \dots, N\} $, where UAV 0 is the leader and the set $ F = \{1, 2, \dots, N\} $ represents the followers. The relative dynamics between follower $ i $ ($ i \in F $) and the leader are modeled in a 2D plane. Let $ x_{x,i} $ and $ x_{y,i} $ denote the positions of UAV $ i $ in the x and y directions, respectively, with velocities $ v_{x,i} $ and $ v_{y,i} $, and control inputs $ a_{x,i} $ and $ a_{y,i} $. The relative dynamics are given by:

$$
\begin{align*}
\dot{x}_{Rx} &= v_{x,0} – v_{x,i}, \\
\dot{x}_{Ry} &= v_{y,0} – v_{y,i}, \\
\dot{v}_{Rx} &= a_{x,0} – a_{x,i}, \\
\dot{v}_{Ry} &= a_{y,0} – a_{y,i},
\end{align*}
$$

where $ x_{Rx} $ and $ x_{Ry} $ are the relative positions, and $ v_{Rx} $ and $ v_{Ry} $ are the relative velocities. This can be expressed in state-space form as:

$$
\dot{\mathbf{x}} = \mathbf{A} \mathbf{x} + \mathbf{B}_0 \mathbf{u}_0 + \sum_{i=1}^N \mathbf{B}_i \mathbf{u}_i,
$$

with state vector $ \mathbf{x} = [x_{Rx}, x_{Ry}, v_{Rx}, v_{Ry}]^T $, leader control $ \mathbf{u}_0 = [a_{x,0}, a_{y,0}]^T $, and follower controls $ \mathbf{u}_i = [a_{x,i}, a_{y,i}]^T $. The matrices $ \mathbf{A} $, $ \mathbf{B}_0 $, and $ \mathbf{B}_i $ are derived from the dynamics. This model ensures that the followers maintain desired offsets from the leader, crucial for formation integrity in drone technology applications.

The Stackelberg game framework is employed to model the hierarchical decision-making process. The leader first selects its strategy, considering the anticipated responses of the followers, who then optimize their actions simultaneously. This leads to a Stackelberg-Nash equilibrium, where no player can unilaterally improve their outcome. The system dynamics are described by:

$$
\dot{\mathbf{x}}(t) = \mathbf{f}(\mathbf{x}(t)) + \mathbf{g}_0(\mathbf{x}(t)) \mathbf{u}_0(t) + \sum_{i=1}^N \mathbf{g}_i(\mathbf{x}(t)) \mathbf{u}_i(t),
$$

where $ \mathbf{f}(\mathbf{x}) $ represents the drift dynamics, and $ \mathbf{g}_0(\mathbf{x}) $, $ \mathbf{g}_i(\mathbf{x}) $ are the input dynamics for the leader and followers, respectively. Each UAV aims to minimize its cost function. For the leader ($ i = 0 $):

$$
J_0(\mathbf{x}, \mathbf{u}_0, \mathbf{u}_1, \dots, \mathbf{u}_N) = \int_0^\infty \left( \mathbf{x}^T \mathbf{Q}_0 \mathbf{x} + \mathbf{u}_0^T \mathbf{R}_0 \mathbf{u}_0 + \sum_{j=1}^N \mathbf{u}_j^T \mathbf{C}_j \mathbf{u}_j \right) dt,
$$

and for follower $ i $ ($ i \in F $):

$$
J_i(\mathbf{x}, \mathbf{u}_0, \mathbf{u}_1, \dots, \mathbf{u}_N) = \int_0^\infty \left( \mathbf{x}^T \mathbf{Q}_i \mathbf{x} + \mathbf{u}_i^T \mathbf{R}_i \mathbf{u}_i + \mathbf{u}_0^T \mathbf{D}_i \mathbf{u}_0 \right) dt,
$$

where $ \mathbf{Q}_0 $, $ \mathbf{Q}_i $ are state weighting matrices, $ \mathbf{R}_0 $, $ \mathbf{R}_i $ are control weighting matrices, and $ \mathbf{C}_j $, $ \mathbf{D}_i $ are coupling parameters. The Stackelberg-Nash equilibrium is defined such that the leader’s strategy $ \mathbf{u}_0^* $ and followers’ strategies $ \mathbf{u}_i^* $ satisfy optimality conditions derived from the HJB equations.

To solve the coupled HJB equations, we propose a two-stage value iteration-based integral reinforcement learning (VI-IRL) algorithm. This hierarchical approach iteratively updates the value functions and control policies for the leader and followers. The value function for UAV $ i $ is defined as:

$$
V_i(\mathbf{x}(t)) = \int_t^\infty r_i(\mathbf{x}(\tau), \mathbf{u}_0(\tau), \mathbf{u}_1(\tau), \dots, \mathbf{u}_N(\tau)) d\tau,
$$

where $ r_i $ is the cost rate. The optimal value function $ V_i^* $ satisfies the HJB equation:

$$
H_i(\mathbf{x}, \mathbf{u}_0^*, \mathbf{u}_1^*, \dots, \mathbf{u}_N^*, \nabla V_i^*) = r_i + (\nabla V_i^*)^T \left( \mathbf{f} + \mathbf{g}_0 \mathbf{u}_0^* + \sum_{j=1}^N \mathbf{g}_j \mathbf{u}_j^* \right) = 0.
$$

The optimal control policies are derived as:

$$
\mathbf{u}_0^* = -\frac{1}{2} \mathbf{K}^{-1} \left( \mathbf{R}_0 \mathbf{K}^{-1} \sum_{j=1}^N \mathbf{g}_j^T \nabla V_0^* + \sum_{j=1}^N \mathbf{C}_j \mathbf{R}_j^{-1} \mathbf{g}_j^T \nabla V_j^* \right),
$$

and for followers:

$$
\mathbf{u}_i^* = -\frac{1}{2} \mathbf{D}_i^{-1} \left( \mathbf{R}_i \mathbf{u}_0^* + \mathbf{g}_i^T \nabla V_i^* \right),
$$

where $ \mathbf{K} = \mathbf{I} – \sum_{j=1}^N \mathbf{C}_j \mathbf{D}_j $. The VI-IRL algorithm involves the following steps:

Step	Description
1	Initialize value functions $ V_i^0 $ and control policies $ \mathbf{u}_i^0 $ for all UAVs.
2	Update the leader’s value function $ V_0^s $ using the integral Bellman equation.
3	Improve the leader’s policy $ \mathbf{u}_0^{s+1} $ based on $ \nabla V_0^s $.
4	Update followers’ value functions $ V_i^s $ simultaneously.
5	Improve followers’ policies $ \mathbf{u}_i^{s+1} $ using $ \nabla V_i^s $.
6	Repeat until convergence criteria are met.

To enhance computational efficiency, neural networks (NNs) are employed to approximate the value functions. The value function for UAV $ i $ is represented as:

$$
\hat{V}_i(\mathbf{x}) = \mathbf{W}_i^T \boldsymbol{\phi}(\mathbf{x}),
$$

where $ \boldsymbol{\phi}(\mathbf{x}) $ is a vector of basis functions, and $ \mathbf{W}_i $ is the weight vector. The gradient $ \nabla \hat{V}_i $ is computed accordingly. The NN-based VI-IRL algorithm iteratively updates the weights to minimize the residual in the integral Bellman equation. The policy improvements become:

$$
\mathbf{u}_0^{s+1} = -\frac{1}{2} \mathbf{K}^{-1} \left( \mathbf{R}_0 \mathbf{K}^{-1} \sum_{j=1}^N \mathbf{g}_j^T \nabla \boldsymbol{\phi} \mathbf{W}_0^s + \sum_{j=1}^N \mathbf{C}_j \mathbf{R}_j^{-1} \mathbf{g}_j^T \nabla \boldsymbol{\phi} \mathbf{W}_j^s \right),
$$

and for followers:

$$
\mathbf{u}_i^{s+1} = -\frac{1}{2} \mathbf{D}_i^{-1} \left( \mathbf{R}_i \mathbf{u}_0^{s+1} + \mathbf{g}_i^T \nabla \boldsymbol{\phi} \mathbf{W}_i^s \right).
$$

The algorithm ensures convergence to the Stackelberg-Nash equilibrium under practical assumptions, such as Lipschitz continuity and stabilizability. The use of drone technology in this context highlights the adaptability of Unmanned Aerial Vehicle systems to dynamic environments.

Numerical simulations are conducted to validate the proposed method. Consider a formation with one leader and six followers. The leader starts at the origin $ (0, 0) $, and followers have initial positions as follows: Follower 1 at $ (-1.2635, 4.6889) $, Follower 2 at $ (6.2352, -3.2659) $, Follower 3 at $ (0.7359, -8.0990) $, Follower 4 at $ (-3.2754, -3.1455) $, Follower 5 at $ (-6.8743, -5.65096) $, and Follower 6 at $ (-9.4034, 1.7095) $. The state weighting matrices are set to $ \mathbf{Q}_0 = 4.2 \mathbf{I}_4 $ and $ \mathbf{Q}_i = 1.5 \mathbf{I}_4 $, control weights $ \mathbf{R}_0 = \mathbf{R}_i = 1.5 $, and coupling parameters are chosen appropriately. The NN activation functions are polynomial basis vectors, and weights are initialized to zero. The control inputs are parameterized as linear functions of the state, and 100 data samples are collected for training.

The simulation results demonstrate rapid convergence of the NN weights within four iterations, indicating the efficiency of the learning process. The relative positions and velocities of followers converge to desired values within 4 seconds, achieving stable formation. The following table summarizes the key parameters used in the simulation:

Parameter	Value	Description
$ \mathbf{Q}_0 $	$ 4.2 \mathbf{I}_4 $	Leader state weight
$ \mathbf{Q}_i $	$ 1.5 \mathbf{I}_4 $	Follower state weight
$ \mathbf{R}_0 $	1.5	Leader control weight
$ \mathbf{R}_i $	1.5	Follower control weight
NN Activation	Polynomial	Basis functions for approximation

The evolution of relative positions and velocities shows that all followers successfully maintain the formation despite initial displacements. This underscores the robustness of the Stackelberg game approach in drone technology applications, where Unmanned Aerial Vehicle systems must adapt to changing conditions. The method effectively balances optimality and computational feasibility, making it suitable for real-time implementation.

In conclusion, this paper presents a novel formation control strategy for UAV swarms based on Stackelberg game theory and hierarchical reinforcement learning. The leader-follower framework, combined with VI-IRL and neural network approximation, enables efficient solution of coupled HJB equations. Simulations confirm the method’s ability to achieve rapid convergence and stable formations, highlighting its potential for advanced drone technology operations. Future work will extend the approach to 3D environments and incorporate real-world uncertainties, further enhancing the applicability of Unmanned Aerial Vehicle systems in complex scenarios.

Step	Description
1	Initialize value functions \( V_i^0 \) and control policies \( \mathbf{u}_i^0 \) for all UAVs.
2	Update the leader’s value function \( V_0^s \) using the integral Bellman equation.
3	Improve the leader’s policy \( \mathbf{u}_0^{s+1} \) based on \( \nabla V_0^s \).
4	Update followers’ value functions \( V_i^s \) simultaneously.
5	Improve followers’ policies \( \mathbf{u}_i^{s+1} \) using \( \nabla V_i^s \).
6	Repeat until convergence criteria are met.