Unmanned Aerial Vehicle Formation Control via Stackelberg Games and Hierarchical Reinforcement Learning

1. Introduction
The evolution of modern warfare and combat systems has significantly elevated the role of unmanned aerial vehicles (UAVs). Their exceptional maneuverability, stealth capabilities, and cost-effectiveness make them indispensable assets in contemporary military operations. While single unmanned aerial vehicles offer advantages, limitations in sensor range, functionality, and quantity restrict their ability for multi-angle reconnaissance, precise positioning, operational scope, and strike efficiency. Crucially, the loss or failure of a single unmanned aerial vehicle during a mission often leads to catastrophic mission failure. Consequently, the demand for coordinated operations using unmanned aerial vehicle swarms has intensified dramatically. An unmanned aerial vehicle swarm is a system composed of multiple autonomously operating unmanned aerial vehicles capable of independent decision-making, information exchange, environmental perception, communication, and collaborative task execution. Swarm systems overcome the limitations of individual unmanned aerial vehicles, enhancing mission flexibility, success rates, and demonstrating immense potential in complex battlefield environments, thereby emerging as a critical modern combat paradigm.

Successful execution of multi-unmanned aerial vehicle missions frequently necessitates the swarm forming specific formations. High-quality formation control for unmanned aerial vehicles is therefore indispensable. While significant progress has been made in formation control strategies, many existing approaches, particularly within the popular leader-follower framework, often neglect optimality considerations during the control process. In this framework, a designated leader unmanned aerial vehicle follows a predefined trajectory, while follower unmanned aerial vehicles acquire the leader’s state in real-time and maintain a relative distance based on predefined strategies, enabling coordinated flight and effective control. This method combines centralized planning with distributed execution. However, traditional cooperative game approaches, which optimize resources and task allocation from a global perspective, require extensive communication and negotiation between unmanned aerial vehicles. This becomes impractical in highly dynamic environments, and computational complexity escalates rapidly with the number of unmanned aerial vehicles.

Optimal control theory addresses performance metric minimization (e.g., time, energy) using the Hamilton-Jacobi-Bellman (HJB) equation to derive optimal control policies. Traditional optimal control often employs open-loop frameworks, lacking adaptability to dynamic environments and exhibiting poor robustness. Reinforcement Learning (RL), specifically Approximate Dynamic Programming (ADP) or Adaptive Dynamic Programming, offers a solution by enabling online learning of optimal control policies through iterative approximation of the optimal performance function. While RL-based optimal control and differential games have been explored, applying them directly to complex, hierarchical interactions within unmanned aerial vehicle swarms remains challenging.

Stackelberg games provide a promising hierarchical, non-cooperative game-theoretic framework for these challenges. In this model, a leader moves first, anticipating the reactions of followers who then optimize their strategies based on the leader’s decision. When applied to unmanned aerial vehicle formation control, the leader can flexibly adjust its position based on environmental changes and mission requirements, while followers react optimally to the leader’s position, adapting to dynamic conditions. Crucially, both leader and followers only need to consider their best response given the other’s strategy, significantly reducing communication and coordination overhead compared to fully cooperative models. While Stackelberg games have shown promise in unmanned aerial vehicle communication resource allocation problems, their application to direct unmanned aerial vehicle formation control is relatively unexplored. Investigating formation control under dynamic adversarial scenarios using Stackelberg game theory aligns closely with practical operational requirements.

This work addresses the multi-unmanned aerial vehicle formation control problem by proposing a novel method integrating Stackelberg game theory and Hierarchical Reinforcement Learning (HRL). We establish a leader-follower relative dynamics model and formulate the interaction as a continuous-time Stackelberg-Nash differential game involving N+1 players (1 leader, N followers). We derive the coupled HJB equations governing the optimal strategies for all participants. To efficiently solve these complex coupled equations, we design a Two-Stage Value Iteration-based Integral Reinforcement Learning (VI-IRL) algorithm. Furthermore, we incorporate neural networks to approximate the value functions and their gradients, enhancing computational efficiency and accuracy. Numerical simulations validate the effectiveness of our approach in achieving rapid and stable formation reconfiguration for unmanned aerial vehicle swarms.

2. Relative Dynamics Modeling

2.1 Leader-Follower UAV Relative Dynamics
To derive control strategies and simulate the control process for a swarm of unmanned aerial vehicles, we first establish equations of motion describing the relative movement between leader and follower unmanned aerial vehicles. This requires defining common coordinate systems and their transformations:

Body Coordinate System (FbFb): Fixed to the unmanned aerial vehicle body, origin at the Center of Gravity (CoG) ObOb. XbXb points forward along the longitudinal axis, YbYb is perpendicular to XbXb within the longitudinal plane of symmetry, ZbZb follows the right-hand rule.
Ground Coordinate System (FgFg): Origin OgOg is the ground projection of a chosen unmanned aerial vehicle’s CoG at takeoff. XgXg points East, YgYg points North, ZgZg follows the right-hand rule.
Inertial Coordinate System (FiFi): Origin at the unmanned aerial vehicle’s CoG OiOi. XiXi, YiYi, ZiZi are parallel to XgXg, YgYg, ZgZg respectively.

Relative motion modeling is conducted within the ground coordinate system (FgFg), describing the relative position and velocity of follower unmanned aerial vehicles with respect to the leader.

2.1.2 Dynamics of Relative Motion
Within the leader-follower unmanned aerial vehicle formation framework, consider a continuous-time Stackelberg game with N+1N+1 unmanned aerial vehicles. The player set is denoted V={0,1,…,N}V={0,1,…,N}. Unmanned aerial vehicle 0 is the leader, occupying the dominant position in the hierarchy. The remaining unmanned aerial vehicles are followers, denoted by the set F={1,2,…,N}F={1,2,…,N}, all occupying equal status.

In the 2D plane xOyxOy, the dynamics of the ii-th unmanned aerial vehicle (i∈Vi∈V) can be expressed as:{x˙i,x=vi,xx˙i,y=vi,yv˙i,x=ai,xv˙i,y=ai,y⎩⎨⎧x˙i,x=vi,xx˙i,y=vi,yv˙i,x=ai,xv˙i,y=ai,y

(1)
where xi,xxi,x and xi,yxi,y represent the position of the ii-th unmanned aerial vehicle in the x and y directions, vi,xvi,x and vi,yvi,y represent its velocity components, and ai,xai,x and ai,yai,y represent its control input signals (accelerations) in the x and y directions.

We assume the leader can communicate with all followers, but no direct communication exists between any two follower unmanned aerial vehicles. Therefore, relative dynamics are established solely between follower ii and the leader (unmanned aerial vehicle 0). The relative dynamics model for any follower ii (i∈Fi∈F) with respect to the leader is:{x˙R,i,x=vL,x−vi,xx˙R,i,y=vL,y−vi,yv˙R,i,x=aL,x−ai,xv˙R,i,y=aL,y−ai,y⎩⎨⎧x˙R,i,x=vL,x−vi,xx˙R,i,y=vL,y−vi,yv˙R,i,x=aL,x−ai,xv˙R,i,y=aL,y−ai,y

(2)
where xR,i,xxR,i,x and xR,i,yxR,i,y represent the relative position of the follower with respect to the leader in the xOyxOy plane, i.e., xR,i,x=xL,x−xi,xxR,i,x=xL,x−xi,x, xR,i,y=xL,y−xi,yxR,i,y=xL,y−xi,y. vR,i,xvR,i,x and vR,i,yvR,i,y represent the relative velocity components, i.e., vR,i,x=vL,x−vi,xvR,i,x=vL,x−vi,x, vR,i,y=vL,y−vi,yvR,i,y=vL,y−vi,y.

Equation (2) can be expressed in state-space form:(x˙R,i,xx˙R,i,yv˙R,i,xv˙R,i,y)=(0010000100000000)(xR,i,xxR,i,yvR,i,xvR,i,y)+(00001001)(aL,xaL,y)+(0000−100−1)(ai,xai,y)x˙R,i,xx˙R,i,yv˙R,i,xv˙R,i,y=0000000010000100xR,i,xxR,i,yvR,i,xvR,i,y+00100001(aL,xaL,y)+00−10000−1(ai,xai,y)

(3)
The state vector for follower ii relative to the leader is xR,i=[xR,i,x,xR,i,y,vR,i,x,vR,i,y]TxR,i=[xR,i,x,xR,i,y,vR,i,x,vR,i,y]T. The control input vector of the leader is uL=[aL,x,aL,y]TuL=[aL,x,aL,y]T, and the control input vector of follower ii is ui=[ai,x,ai,y]Tui=[ai,x,ai,y]T. The system matrices are:A=(0010000100000000),BL=(00001001),BF,i=(0000−100−1)A=0000000010000100,BL=00100001,BF,i=00−10000−1

Thus, x˙R,i=AxR,i+BLuL+BF,iuix˙R,i=AxR,i+BLuL+BF,iui.

3. Problem Formulation: Stackelberg-Nash Differential Game

3.1 Stackelberg Game Framework
The Stackelberg game model originates from economics involving hierarchical decision-making between two players. The leader aims to optimize its performance metric by anticipating the follower’s best response. The follower observes the leader’s decision and then chooses its optimal strategy. When one leader and multiple followers exist, and all followers make decisions simultaneously, this hierarchical problem is modeled as a Stackelberg-Nash game.

For the continuous-time Stackelberg game with N+1N+1 unmanned aerial vehicles, the overall system dynamics incorporating relative states can be represented as:x˙=f(x(t))+g0(x(t))u0(t)+∑i=1Ngi(x(t))ui(t)x˙=f(x(t))+g0(x(t))u0(t)+i=1∑Ngi(x(t))ui(t)

(4)
where x(t)∈Rnx(t)∈Rn is the composite state vector (e.g., encompassing all relative states xR,ixR,i), u0(t)∈U0⊆Rm0u0(t)∈U0⊆Rm0 is the leader’s control strategy, ui(t)∈Ui⊆Rmiui(t)∈Ui⊆Rmi is the control strategy of the ii-th follower (i∈Fi∈F). f(x)∈Rnf(x)∈Rn represents the system drift dynamics, and gi(x)∈Rn×migi(x)∈Rn×mi represent the input dynamics matrices. We make the following assumptions:

Assumption 1: f(x(t))f(x(t)) and gi(x(t))gi(x(t)) are Lipschitz continuous on a set Ω⊆RnΩ⊆Rn containing the origin, i∈Vi∈V, and system (4) is stabilizable on ΩΩ.
Assumption 2: f(0)=0f(0)=0, and ui(0)=0ui(0)=0 for i∈Vi∈V.

3.2 Cost Functions and Equilibrium
In this Stackelberg game, each player seeks to minimize its cost function:
For i∈Vi∈V,Ji(x0,u0,u)=∫0∞ri(x(t),u0(t),u(t))dtJi(x0,u0,u)=∫0∞ri(x(t),u0(t),u(t))dt

(5)
The instantaneous cost functions riri are defined as quadratic forms reflecting state error and control effort, incorporating coupling between leader and followers:

For the leader (i=0i=0):r0(x,u0,u)=xTQ0x+(u0+∑j=1NCjuj)TR0(u0+∑j=1NCjuj)r0(x,u0,u)=xTQ0x+(u0+j=1∑NCjuj)TR0(u0+j=1∑NCjuj)(6)
where Q0=Q0T>0Q0=Q0T>0 is the state weighting matrix, R0=R0T>0R0=R0T>0 is the control weighting matrix, and Cj∈Rm0×mjCj∈Rm0×mj are coupling matrices between the leader and the jj-th follower.
For a follower i∈Fi∈F:ri(x,u0,u)=xTQix+(ui+Diu0)TRi(ui+Diu0)ri(x,u0,u)=xTQix+(ui+Diu0)TRi(ui+Diu0)(7)
where Qi=QiT>0Qi=QiT>0, Ri=RiT>0Ri=RiT>0, and Di∈Rmi×m0Di∈Rmi×m0 are coupling matrices between follower ii and the leader.

3.3 Stackelberg-Nash Equilibrium (SNE)
The solution concept for this hierarchical game is the Stackelberg-Nash Equilibrium (SNE):

Definition (Stackelberg-Nash Equilibrium): A strategy profile (u0∗,u1∗,⋯ ,uN∗)∈U0×U1×⋯×UN(u0∗,u1∗,⋯,uN∗)∈U0×U1×⋯×UN constitutes a Stackelberg-Nash Equilibrium if:
1. For each follower i∈Fi∈F, there exists a mapping Ti:U0→UiTi:U0→Ui such that for any fixed leader strategy u0∈U0u0∈U0:Ji(x0,u0,Ti(u0),T−i(u0))≤Ji(x0,u0,ui,T−i(u0))Ji(x0,u0,Ti(u0),T−i(u0))≤Ji(x0,u0,ui,T−i(u0))holds for all ui∈Uiui∈Ui, where T−i(u0)≜{Tj(u0)∣j∈F,j≠i}T−i(u0)≜{Tj(u0)∣j∈F,j=i}.
2. There exists a leader strategy u0∗u0∗ such that:J0(x0,u0∗,T1(u0∗),⋯ ,TN(u0∗))≤J0(x0,u0,T1(u0),⋯ ,TN(u0))J0(x0,u0∗,T1(u0∗),⋯,TN(u0∗))≤J0(x0,u0,T1(u0),⋯,TN(u0))holds for all u0∈U0u0∈U0, where ui∗=Ti(u0∗)ui∗=Ti(u0∗) for i∈Fi∈F.

Essentially, at SNE, the leader commits to its optimal strategy u0∗u0∗ anticipating the followers’ best responses Ti(u0∗)Ti(u0∗), and each follower ii, knowing the leader’s strategy u0∗u0∗ and the strategies of other followers T−i(u0∗)T−i(u0∗), has no incentive to deviate from ui∗=Ti(u0∗)ui∗=Ti(u0∗).

4. Theoretical Foundation and Algorithm Design

4.1 Optimal Control and HJB Equations
The optimal strategies for the leader and followers, constituting the SNE, are governed by a set of coupled Hamilton-Jacobi-Bellman (HJB) equations derived from dynamic programming principles applied to the hierarchical structure.

The value function Vi(x(t))Vi(x(t)) for player i∈Vi∈V represents the cost-to-go from state x(t)x(t) under the current policies:Vi(x(t))=∫t∞ri(x(τ),u0(τ),u(τ))dτVi(x(t))=∫t∞ri(x(τ),u0(τ),u(τ))dτ

(8)
The optimal value function Vi∗(x(t))Vi∗(x(t)) and the optimal policy ui∗(x(t))ui∗(x(t)) satisfy:ui∗(x(t))=arg⁡min⁡uiVi∗(x(t)),i∈Vui∗(x(t))=arguiminVi∗(x(t)),i∈VVi∗(x(t))={min⁡u0∫t∞r0(x,u0,u∗)dτ,i=0min⁡ui∫t∞ri(x,u0,u−i∗,ui)dτ,i∈FVi∗(x(t))={minu0∫t∞r0(x,u0,u∗)dτ,minui∫t∞ri(x,u0,u−i∗,ui)dτ,i=0i∈F

(9)

The Hamiltonian function for player ii is constructed as:Hi(x,∇Vi,u0,u)=ri(x,u0,u)+(∇Vi)T[f(x)+∑j=0Ngj(x)uj]Hi(x,∇Vi,u0,u)=ri(x,u0,u)+(∇Vi)T[f(x)+j=0∑Ngj(x)uj]

(10)
where ∇Vi≜∂Vi(x(t))/∂x(t)∈Rn∇Vi≜∂Vi(x(t))/∂x(t)∈Rn is the gradient of the value function.

The optimality conditions (∂Hi/∂ui=0∂Hi/∂ui=0) yield the optimal feedback policies:

Follower Optimal Policy (for given u0u0): For any given leader strategy u0u0, the optimal policy for follower i∈Fi∈F is:ui∗(u0)=−Diu0−12Ri−1gi(x)T∇Viui∗(u0)=−Diu0−21Ri−1gi(x)T∇Vi(11)
where ∇Vi∇Vi corresponds to the value function under strategies u0u0 and uj∗(u0)uj∗(u0) for j∈Fj∈F.
Leader Optimal Policy: The leader, anticipating the followers’ best response mappings ui∗(u0)ui∗(u0), determines its optimal strategy:u0∗=arg⁡min⁡u0H0(x,∇V0∗,u0,u∗(u0))u0∗=argu0minH0(x,∇V0∗,u0,u∗(u0))(12)
Solving ∂H0/∂u0=0∂H0/∂u0=0 gives:u0∗=12(KTR0K)−1(∑j=1Ngj(x)Dj−g0(x))T∇V0∗+12K−1∑j=1NCjRj−1gj(x)T∇Vj∗u0∗=21(KTR0K)−1(j=1∑Ngj(x)Dj−g0(x))T∇V0∗+21K−1j=1∑NCjRj−1gj(x)T∇Vj∗(13)
where K=IR0−∑j=1NCjDjK=IR0−∑j=1NCjDj (assuming ∑j=1NCjDj≠IR0∑j=1NCjDj=IR0) is defined for iterative computation.

The optimal follower policy under the leader’s optimal strategy is:ui∗(u0∗)=−Diu0∗−12Ri−1gi(x)T∇Vi∗,i∈Fui∗(u0∗)=−Diu0∗−21Ri−1gi(x)T∇Vi∗,i∈F

(14)

The optimal value functions Vi∗Vi∗ for all players i∈Vi∈V must satisfy the following system of coupled HJB equations:Hi(x,∇Vi∗,u0∗,u∗)=ri(x,u0∗,u∗)+(∇Vi∗)T[f(x)+∑j=0Ngj(x)uj∗]=0Hi(x,∇Vi∗,u0∗,u∗)=ri(x,u0∗,u∗)+(∇Vi∗)T[f(x)+j=0∑Ngj(x)uj∗]=0

(15)
with boundary condition Vi∗(0)=0Vi∗(0)=0.

Theorem 1 (Stability and Optimality): Suppose there exist smooth value functions Vi∗(x)>0Vi∗(x)>0 for i∈Vi∈V, satisfying Vi∗(0)=0Vi∗(0)=0 and the coupled HJB equations (15). Let the control policies be given by (13) and (14). Then:

The closed-loop system (4) is asymptotically stable.
The policies (u0∗,u1∗,⋯ ,uN∗)(u0∗,u1∗,⋯,uN∗) constitute a Stackelberg-Nash Equilibrium for the game.

4.2 Hierarchical VI-IRL Algorithm Design
Solving the coupled HJB equations (15) analytically is generally intractable for nonlinear systems. We propose a Two-Stage Value Iteration-based Integral Reinforcement Learning (VI-IRL) algorithm to learn approximate solutions online. This algorithm leverages the integral form of the Bellman equation and incorporates the hierarchical structure of the Stackelberg game.

4.2.1 Integral Reinforcement Learning (IRL)
For an arbitrary integral reinforcement interval length T>0T>0, the value function (8) for player i∈Vi∈V can be written in integral form over [t,t+T][t,t+T]:Vi(x(t))=∫tt+Tri(x,u0,u)dτ+Vi(x(t+T))Vi(x(t))=∫tt+Tri(x,u0,u)dτ+Vi(x(t+T))

(16)
This is the Integral Bellman Equation. The term ∫tt+Tri(x,u0,u)dτ∫tt+Tri(x,u0,u)dτ is the integral reinforcement over [t,t+T][t,t+T]. Under suitable conditions, (16) is equivalent to the differential Bellman equation (the HJB for non-optimal policies).

For the optimal value function under the optimal policies, we have:Vi∗(x(t))=∫tt+Tri(x,u0∗,u∗)dτ+Vi∗(x(t+T))Vi∗(x(t))=∫tt+Tri(x,u0∗,u∗)dτ+Vi∗(x(t+T))

(17)

4.2.2 Two-Stage VI-IRL Algorithm
The hierarchical nature of the Stackelberg game necessitates a structured iterative approach. We design a two-stage algorithm where the leader’s policy and value function are updated first, followed by the simultaneous update of all followers’ policies and value functions, leveraging the leader’s updated strategy.

*Table 1: Two-Stage Value Iteration-based Integral Reinforcement Learning (VI-IRL) Algorithm*

Step	Description	Player(s)
1	Initialize iteration index s=0s=0. Initialize value function estimates V0s=0V0s=0, Vis=0Vis=0 for i∈Fi∈F. Initialize admissible continuous control policies u0su0s, uisuis.	All
2	Leader Value Update: Solve for V0sV0s using:	Leader (0)
	r0(x,u0s,us)+(∇V0s)T[f(x)+∑j=0Ngj(x)ujs]=0r0(x,u0s,us)+(∇V0s)T[f(x)+∑j=0Ngj(x)ujs]=0
3	Leader Policy Improvement: Update leader policy:	Leader (0)
	u0s+1=12(KTR0K)−1(∑j=1Ngj(x)Dj−g0(x))T∇V0s+12K−1∑j=1NCjRj−1gj(x)T∇Vjsu0s+1=21(KTR0K)−1(∑j=1Ngj(x)Dj−g0(x))T∇V0s+21K−1∑j=1NCjRj−1gj(x)T∇Vjs
4	Followers Value Update: For each follower i∈Fi∈F, solve for VisVis using:	Followers (F)
	ri(x,u0s+1,uis,u−is)+(∇Vis)T[f(x)+∑j=0Ngj(x)ujs]=0ri(x,u0s+1,uis,u−is)+(∇Vis)T[f(x)+∑j=0Ngj(x)ujs]=0
5	Followers Policy Improvement: For each follower i∈Fi∈F, update policy:	Followers (F)
	uis+1=−Diu0s+1−12Ri−1gi(x)T∇Visuis+1=−Diu0s+1−21Ri−1gi(x)T∇Vis
6	Set s=s+1s=s+1. If convergence criteria met (e.g., ∥uis+1−uis∥<ϵ∥uis+1−uis∥<ϵ, ∥Vis+1−Vis∥<ϵ∥Vis+1−Vis∥<ϵ for all i∈Vi∈V), stop. Otherwise, go to Step 2.	All

4.2.3 Convergence Analysis
The convergence of the VI-IRL algorithm to the optimal SNE solution can be established under Assumptions 1 and 2, with additional mild technical conditions related to the coupling strength and the magnitude of input dynamics.

Lemma 1 (Value Function Bound): Define {Vis(x(t)),s=0,1,⋯ }{Vis(x(t)),s=0,1,⋯} and {uis,s=0,1,⋯ }{uis,s=0,1,⋯} as the sequences generated by the VI-IRL algorithm for player i∈Vi∈V. Define an auxiliary sequence Γis+1(x(t))Γis+1(x(t)) based on the integral equation with improved policies. If initialized with Vi0(x)=Γi0(x)=0Vi0(x)=Γi0(x)=0, then Vis(x)≤Γis(x)Vis(x)≤Γis(x) for all ss and all xx.
Lemma 2 (Boundedness): Under Assumptions 1 and 2, there exists a state-dependent upper bound Zi(x(t))Zi(x(t)) such that Vis(x(t))≤Zi(x(t))Vis(x(t))≤Zi(x(t)) for each player i∈Vi∈V and all ss.
Theorem 2 (Convergence): Under Assumptions 1 and 2, if the norms of the input dynamics ∥gi(x)∥∥gi(x)∥ are sufficiently small for i∈Vi∈V, and the following inequalities hold:σmin(R0)∥∑j=1NCjΔujs∥>∥KTR0K∥∥g0∥∥∇V0s∥σmin(R0)j=1∑NCjΔujs>KTR0K∥g0∥∥∇V0s∥(18)σmin(Ri)∥DiΔu0s+1∥>∥Ri∥∥gi∥∥∇Vis∥,i∈Fσmin(Ri)∥DiΔu0s+1∥>∥Ri∥∥gi∥∥∇Vis∥,i∈F(19)
where σmin(⋅)σmin(⋅) denotes the minimum singular value, and ΔuΔu terms represent policy differences between iterations, then the sequence {Vis(x(t))}{Vis(x(t))} converges for all i∈Vi∈V, i.e., lim⁡s→∞Vis(x(t))=Vi∞(x(t))<∞lims→∞Vis(x(t))=Vi∞(x(t))<∞.
Theorem 3 (Optimality): Under Assumptions 1, 2, and an additional assumption on the smoothness of the value function gradients (Assumption 3: ∥∇Vis−∇Vis−1∥≤F∥Vis−Vis−1∥∥∇Vis−∇Vis−1∥≤F∥Vis−Vis−1∥ for some constant F>0F>0, i∈Vi∈V), and assuming the uniqueness of the solution to the HJB equations, the converged policies are optimal: lim⁡s→∞uis=ui∗lims→∞uis=ui∗, lim⁡s→∞Vis=Vi∗lims→∞Vis=Vi∗ for all i∈Vi∈V.

4.3 Neural Network Implementation
To implement the VI-IRL algorithm practically, especially for nonlinear systems, we employ neural networks (NNs) to approximate the value functions Vi(x)Vi(x) and their gradients ∇Vi(x)∇Vi(x).

4.3.1 Value Function Approximation
According to the Weierstrass approximation theorem, a linearly independent set of basis functions {φi,k(x)}k=1Li{φi,k(x)}k=1Li can uniformly approximate Vi(x)Vi(x) and its gradient. We use a single-layer NN:V^i(x)=(Wic)Tφi(x)=∑k=1Liwi,kcφi,k(x)V^i(x)=(Wic)Tφi(x)=k=1∑Liwi,kcφi,k(x)

(20)
where φi(x)=[φi,1(x),φi,2(x),⋯ ,φi,Li(x)]T∈RLiφi(x)=[φi,1(x),φi,2(x),⋯,φi,Li(x)]T∈RLi is the vector of activation functions (e.g., polynomials, radial basis functions), Wic=[wi,1c,wi,2c,⋯ ,wi,Lic]T∈RLiWic=[wi,1c,wi,2c,⋯,wi,Lic]T∈RLi is the vector of NN weights, and LiLi is the number of neurons. The approximation of the gradient is:∇V^i(x)=∇[φi(x)T]Wic=(∂φi(x)∂x)TWic∇V^i(x)=∇[φi(x)T]Wic=(∂x∂φi(x))TWic

(21)

4.3.2 Policy Approximation
Using the NN approximations (20, 21) in the policy improvement equations (13, 11):

Leader Policy:u^0s+1=12(KTR0K)−1(∑j=1Ngj(x)Dj−g0(x))T∇[φ0TW0c,s]+12K−1∑j=1NCjRj−1gj(x)T∇[φjTWjc,s]u^0s+1=21(KTR0K)−1(j=1∑Ngj(x)Dj−g0(x))T∇[φ0TW0c,s]+21K−1j=1∑NCjRj−1gj(x)T∇[φjTWjc,s](22)
Follower ii Policy:u^is+1=−Diu^0s+1−12Ri−1gi(x)T∇[φiTWic,s]u^is+1=−Diu^0s+1−21Ri−1gi(x)T∇[φiTWic,s](23)

4.3.3 Weight Update via IRL Bellman Error
The NN weights are updated to minimize the error in satisfying the Integral Bellman Equation (16) under the current policy approximations. Substituting (20) and the policy approximations into (16) yields a residual (Bellman error) for each player:δic,s(t)=V^is(x(t))−∫tt+Tri(x,u^0s+1,u^s)dτ−V^is(x(t+T))δic,s(t)=V^is(x(t))−∫tt+Tri(x,u^0s+1,u^s)dτ−V^is(x(t+T))

(24)
For the leader (i=0i=0), u^su^s includes u^0s+1u^0s+1 and u^jsu^js for j∈Fj∈F. For followers (i∈Fi∈F), u^su^s includes u^0s+1u^0s+1, u^is+1u^is+1, and u^jsu^js for j∈F,j≠ij∈F,j=i.

The goal is to find weights Wic,sWic,s that minimize δic,s(t)δic,s(t) for sampled state trajectories. This is typically done using least-squares methods (e.g., Batch Least Squares, Recursive Least Squares) over data collected by applying the current policies to the system or simulating it.

*Table 2: Neural Network-based VI-IRL Implementation*

Step	Description	Player(s)
1	Initialize iteration index s=0s=0. Initialize NN weight vectors W0c,0=0W0c,0=0, Wic,0=0Wic,0=0 for i∈Fi∈F. Initialize admissible continuous control policies u^00u^00, u^i0u^i0.	All
2	Data Collection: Apply current policies u^0su^0s, u^isu^is to the system or simulate dynamics. Collect state-input trajectories and compute costs over intervals [tk,tk+T][tk,tk+T]. Store sample sets.	System
3	Leader NN Weight Update: Using collected data, solve for W0c,sW0c,s by minimizing the Bellman error δ0c,s(tk)δ0c,s(tk) (Eq. 24 for i=0) via least squares.	Leader (0)
4	Leader Policy Improvement: Compute new leader policy u^0s+1u^0s+1 using Eq. (22) with updated W0c,sW0c,s.	Leader (0)
5	Followers NN Weight Update: For each follower i∈Fi∈F, using collected data and new leader policy u^0s+1u^0s+1, solve for Wic,sWic,s by minimizing δic,s(tk)δic,s(tk) (Eq. 24 for i∈F) via least squares.	Followers (F)
6	Followers Policy Improvement: For each follower i∈Fi∈F, compute new policy u^is+1u^is+1 using Eq. (23) with u^0s+1u^0s+1 and updated Wic,sWic,s.	Followers (F)
7	Set s=s+1s=s+1. If convergence criteria met (e.g., ∥Wic,s−Wic,s−1∥<ζ∥Wic,s−Wic,s−1∥<ζ, small ζ>0ζ>0, for all i∈Vi∈V), stop. Otherwise, go to Step 2.	All

5. Simulation Analysis

5.1 Simulation Setup
To validate the proposed Stackelberg game and HRL-based formation control method for unmanned aerial vehicles, we conducted numerical simulations using MATLAB/Simulink. We considered a formation comprising one leader unmanned aerial vehicle (UAV 0) and six follower unmanned aerial vehicles (F={1,2,3,4,5,6}F={1,2,3,4,5,6}).

Initial Conditions:
- Leader (UAV 0): Initial Position (xL,yL)=(0,0)(xL,yL)=(0,0)
- Follower 1: Initial Position (−1.2635,4.6889)(−1.2635,4.6889)
- Follower 2: Initial Position (6.2352,−3.2659)(6.2352,−3.2659)
- Follower 3: Initial Position (0.7359,−8.0990)(0.7359,−8.0990)
- Follower 4: Initial Position (−3.2754,−3.1455)(−3.2754,−3.1455)
- Follower 5: Initial Position (−6.8743,−5.65096)(−6.8743,−5.65096)
- Follower 6: Initial Position (−9.4034,1.7095)(−9.4034,1.7095)
  Initial velocities for all unmanned aerial vehicles were set to zero relative to the ground frame.
Algorithm Parameters:
- Cost Weights: Q0=2I4Q0=2I4, Qi=1.5I4Qi=1.5I4 for i∈Fi∈F (I4I4 is 4×4 identity matrix). R0=1.5R0=1.5, Ri=1.5Ri=1.5 for i∈Fi∈F.
- Coupling Matrices: CjCj and DiDi were set based on the relative dynamics structure.
- Neural Networks: Polynomial activation functions were used for both leader and followers:φ0(x)=φi(x)=[x12,x22,x32,x42,x1x2,x1x3,x1x4,x2x3,x2x4,x3x4]Tφ0(x)=φi(x)=[x12,x22,x32,x42,x1x2,x1x3,x1x4,x2x3,x2x4,x3x4]TInitial weights: W0c,0=010W0c,0=010, Wic,0=010Wic,0=010 for i∈Fi∈F.
- Initial Policies: Linear state feedback controllers with pre-defined gains were used for initialization.
- State Constraints: Reflecting communication and collision avoidance: xR,x,xR,y∈[−10,10]xR,x,xR,y∈[−10,10], vR,x,vR,y∈[−0.1,0.1]vR,x,vR,y∈[−0.1,0.1].
- Data Collection: 100 data points were randomly sampled within the constrained state space for NN training.
- Iteration: The NN-based VI-IRL algorithm (Table 2) was run for 10 iterations.

5.2 Simulation Results and Analysis

Neural Network Weight Convergence: The evolution of the NN weights for the leader (W0W0) and a representative follower (W1W1) is shown conceptually below. The results demonstrated that all components of the weight vectors for both leader and follower value function NNs stabilized and converged to near-optimal values within approximately 4 iterations. This indicates the effectiveness of the two-stage VI-IRL algorithm in learning the value functions associated with the Stackelberg-Nash equilibrium strategies.
- Conceptual Description: Plots depicting the 10 components of W0sW0s and W1sW1s vs. iteration number ss would show initial transients settling rapidly to constant values by iteration 4-5, confirming convergence.
Formation Reconfiguration Performance: The converged control policies obtained after 10 iterations of the VI-IRL algorithm were applied to control the unmanned aerial vehicle formation. The relative position (xR,x,xR,yxR,x,xR,y) and relative velocity (vR,x,vR,yvR,x,vR,y) trajectories of all six follower unmanned aerial vehicles with respect to the leader were monitored.
- Conceptual Description:
  - Figures 3 & 4 Analogy: Plots of relative position and velocity components vs. time for the six followers would be presented in two groups (e.g., Followers 1-3 and Followers 4-6). All trajectories would exhibit smooth convergence.
  - Key Observation: All follower unmanned aerial vehicles successfully converged to their desired relative positions (specific setpoints determined by the cost function and equilibrium) and achieved zero relative velocity with respect to the leader within approximately 4 seconds. This demonstrates that the learned hierarchical RL control policy efficiently drives the unmanned aerial vehicle swarm to the desired formation configuration and maintains it stably.
- Overall Formation Visualization: A schematic diagram depicting the positions of all unmanned aerial vehicles relative to the leader at the initial time (t0t0) and at the final time (tftf) after convergence (around 4 seconds) would visually confirm the successful reconfiguration from the scattered initial positions to the desired formation geometry.

Table 3: Key Performance Observations from Simulation

Aspect	Observation	Implication
NN Weight Convergence	Weights for leader and follower value functions stabilized within 4-5 iterations.	Algorithm efficiently learns optimal value functions. Robustness to initialization observed.
Relative Position Convergence	All follower unmanned aerial vehicles reached desired relative positions w.r.t. leader.	Formation control objective successfully achieved. Precise geometric configuration attained.
Relative Velocity Convergence	Relative velocities of all followers w.r.t. leader converged to zero.	Formation is stable and maintained. Unmanned aerial vehicles move cohesively with the leader.
Convergence Time	Formation reconfiguration completed within ~4 seconds.	Proposed HRL control policy is efficient and enables rapid response suitable for dynamic environments.
Stability	No oscillations or instabilities observed after convergence.	Closed-loop system under learned SNE policies is stable, as predicted by Theorem 1.

6. Conclusion
This work presented a novel integrated approach for optimal unmanned aerial vehicle formation control, combining Stackelberg game theory and Hierarchical Reinforcement Learning (HRL). We addressed the challenge of coordinating a leader unmanned aerial vehicle and multiple follower unmanned aerial vehicles by formulating their interaction as a continuous-time Stackelberg-Nash differential game. A key contribution was the derivation of the coupled Hamilton-Jacobi-Bellman (HJB) equations governing the optimal strategies at the Stackelberg-Nash Equilibrium (SNE).

To overcome the computational intractability of solving these coupled HJB equations directly, we designed a computationally efficient Two-Stage Value Iteration-based Integral Reinforcement Learning (VI-IRL) algorithm. This algorithm explicitly respects the hierarchical leader-follower structure, updating the leader’s policy and value function first, followed by the simultaneous update of all followers’ policies and value functions using the leader’s updated strategy. Furthermore, we leveraged neural networks to approximate the value functions and their gradients, significantly enhancing the algorithm’s ability to handle nonlinearities and improving computational efficiency and accuracy.

Numerical simulations involving a swarm of seven unmanned aerial vehicles (one leader, six followers) demonstrated the effectiveness of the proposed method. The neural network weights converged rapidly, typically within 4-5 iterations. More importantly, the learned control policies enabled all follower unmanned aerial vehicles to converge smoothly and rapidly (within approximately 4 seconds) to their desired relative positions with zero relative velocity with respect to the leader, achieving stable formation reconfiguration. This confirms that the proposed Stackelberg game and HRL framework successfully learns optimal SNE strategies for efficient and stable unmanned aerial vehicle formation control.

The proposed method offers significant advantages for unmanned aerial vehicle swarm operations in complex and dynamic battlefield environments, enabling adaptable, communication-efficient, and optimal formation control. Future work will focus on extending the framework to 3D formation control, reducing model dependency through robust and adaptive learning techniques, incorporating dynamic obstacles and adversaries explicitly, investigating multi-objective optimization within the game, and conducting real-world flight tests for validation.