6-DOF UAV Missile Avoidance Decisions Based on Deep Reinforcement Learning Algorithm

In Beyond-Visual-Range (BVR) air combat, the operational advantages of the attacking side are significantly enhanced, making missile evasion a critical capability for improving Unmanned Aerial Vehicle survivability. Traditional missile avoidance strategies often rely on mathematical models, which struggle to adapt to complex and dynamic environments. With the advancement of automation control and artificial intelligence technologies, autonomous air combat has become a key research area. We propose a novel approach using deep reinforcement learning to address the challenges of convergence difficulties, poor agent generalization, and the difficulty in controlling six-degree-of-freedom (6-DOF) Unmanned Aerial Vehicles, especially after high-maneuver actions that can lead to loss of control.

Our method, termed GC-PPO, integrates a Gated Recurrent Unit (GRU) with Cosine Annealing-enhanced Proximal Policy Optimization (PPO-Clip) within a hierarchical framework. The GC-PPO algorithm serves as the policy generation layer to develop high-level strategies, while Proportional-Integral-Derivative (PID) controllers in the control layer translate these strategies into actual control commands. This structure effectively handles the high coupling and control challenges of 6-DOF Unmanned Aerial Vehicles. Additionally, we employ Cosine Annealing to dynamically adjust the learning rate and design a reward function system with Reward Shaping to guide the training process toward rapid convergence.

We conduct simulation experiments in randomly generated BVR air combat scenarios to compare various configurations: PPO with and without reward shaping, PPO in a hierarchical framework, and our GC-PPO in a hierarchical framework. The results demonstrate that our GC-PPO algorithm enables Unmanned Aerial Vehicles to effectively evade missiles while maintaining stable flight across diverse scenarios. It outperforms other algorithms in convergence speed and model generalization, significantly enhancing the survival rate of the Unmanned Aerial Vehicle. This work focuses on the JUYE UAV model as a case study to illustrate the practical applications.

Background Knowledge

Deep reinforcement learning combines the perceptual capabilities of deep learning with the decision-making of reinforcement learning, making it suitable for complex tasks like missile avoidance in air combat. Key algorithms include policy gradient methods, PPO-Clip, PID control, and GRU networks. Policy gradient algorithms optimize the policy directly by estimating the gradient of the expected reward. The gradient estimator is given by:

$$ \hat{g} = \mathbb{E}_t \left[ \nabla_\theta \log \pi_\theta(a_t | s_t) \hat{A}_t \right] $$

where $\pi_\theta$ is the stochastic policy, $\hat{A}_t$ is the advantage function estimate at time step $t$, and $\mathbb{E}_t$ denotes the empirical average over a finite batch of samples. The objective function for policy gradient is:

$$ L^{PG}(\theta) = \mathbb{E}_t \left[ \log \pi_\theta(a_t | s_t) \hat{A}_t \right] $$

PPO-Clip algorithm improves training stability by limiting policy updates. It uses a clipped objective function to prevent large changes in the policy. The PPO-Clip objective is defined as:

$$ L^{CLIP}(\theta) = \mathbb{E}_t \left[ \min \left( r_t(\theta) \hat{A}_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) \hat{A}_t \right) \right] $$

where $r_t(\theta) = \frac{\pi_\theta(a_t | s_t)}{\pi_{\theta_{\text{old}}}(a_t | s_t)}$ is the probability ratio, and $\epsilon$ is a hyperparameter, typically 0.1 or 0.2. This clipping mechanism ensures that the policy does not update too aggressively, maintaining training stability.

PID control is a fundamental algorithm in control systems, using proportional, integral, and derivative terms to compute control actions. The PID formula is:

$$ u(t) = K_P E(t) + K_I \int_0^t E(\tau) d\tau + K_D \frac{dE(t)}{dt} $$

where $E(t)$ is the error, $K_P$, $K_I$, and $K_D$ are the proportional, integral, and derivative gains, respectively. In our hierarchical framework, PID controllers convert high-level strategies into actual control commands for the Unmanned Aerial Vehicle, such as elevator, aileron, throttle, and rudder inputs.

Gated Recurrent Unit (GRU) is a type of recurrent neural network that addresses the vanishing gradient problem in long sequences. It uses reset and update gates to control the flow of information. The GRU equations are:

$$ \begin{aligned}
z_t &= \sigma(W_z \cdot [h_{t-1}, x_t]) \\
r_t &= \sigma(W_r \cdot [h_{t-1}, x_t]) \\
\tilde{h}_t &= \tanh(W \cdot [r_t \odot h_{t-1}, x_t]) \\
h_t &= (1 – z_t) \odot h_{t-1} + z_t \odot \tilde{h}_t
\end{aligned} $$

where $z_t$ is the update gate, $r_t$ is the reset gate, $\tilde{h}_t$ is the candidate hidden state, and $h_t$ is the current hidden state. GRU helps in extracting temporal features from sequential state information, which is crucial for making informed decisions in dynamic air combat scenarios involving the JUYE UAV.

Problem Description

We model the 6-DOF Unmanned Aerial Vehicle dynamics using the F-16 aircraft model, which includes nonlinear equations of motion. The 6-DOF model accounts for both linear and rotational motions, represented by 12 state equations. The key parameters include position $(x, y, z)$, velocity $V$, angles of attack and sideslip ($\alpha$, $\beta$), Euler angles ($\phi$, $\theta$, $\psi$), and angular rates ($p$, $q$, $r$). The equations are summarized as:

$$ \begin{aligned}
\dot{V} &= \frac{1}{m} (F_x \cos \alpha \cos \beta + F_y \sin \beta + F_z \sin \alpha \cos \beta) \\
\dot{\alpha} &= \frac{1}{mV \cos \beta} (-F_x \sin \alpha + F_z \cos \alpha) + q – \tan \beta (p \cos \alpha + r \sin \alpha) \\
\dot{\beta} &= \frac{1}{mV} (-F_x \cos \alpha \sin \beta + F_y \cos \beta – F_z \sin \alpha \sin \beta) + p \sin \alpha – r \cos \alpha \\
\dot{\phi} &= p + q \sin \phi \tan \theta + r \cos \phi \tan \theta \\
\dot{\theta} &= q \cos \phi – r \sin \phi \\
\dot{\psi} &= \frac{q \sin \phi + r \cos \phi}{\cos \theta}
\end{aligned} $$

where $F_x$, $F_y$, $F_z$ are the force components, and $m$ is the mass of the Unmanned Aerial Vehicle. The control inputs include elevator, aileron, throttle, and rudder, which are managed by PID controllers in our framework.

For the BVR air combat missile avoidance problem, we consider a 1v1 scenario where a friendly Unmanned Aerial Vehicle must evade a missile launched by an enemy aircraft. The relative态势 between the Unmanned Aerial Vehicle and the missile is described by parameters such as antenna train angle (ATA), heading cross angle (HCA), relative distance, and velocity. The formulas for these parameters are:

$$ \begin{aligned}
\text{ATA} &= \arccos \left( \frac{\mathbf{v}_p \cdot \mathbf{D}}{|\mathbf{v}_p| |\mathbf{D}|} \right) \\
\text{HCA} &= \arccos \left( \frac{\mathbf{v}_p \cdot \mathbf{v}_m}{|\mathbf{v}_p| |\mathbf{v}_m|} \right) \\
d &= \sqrt{(X_m – X_p)^2 + (Y_m – Y_p)^2 + (Z_m – Z_p)^2}
\end{aligned} $$

where $\mathbf{v}_p$ and $\mathbf{v}_m$ are the velocity vectors of the Unmanned Aerial Vehicle and missile, respectively, and $\mathbf{D}$ is the distance vector. The state space for the reinforcement learning agent includes 12 dimensions, capturing relative motion information like relative height, speed, distance, and angular rates.

Table 1: State Space Dimensions for the Unmanned Aerial Vehicle
Parameter	Description
ATA	Antenna train angle
HCA	Heading cross angle
$\Delta h$	Relative height
$\Delta v$	Relative velocity
$\Delta \text{range}$	Relative distance
$\Delta \omega$	Angular velocity
$\text{alt}$	Altitude
$\text{escape angle}$	Escape angle

The state vector is defined as:

$$ \text{State} = \left[ \text{ATA}, \text{HCA}, \Delta h, \Delta v, \Delta \text{range}, \Delta \omega, \text{alt}, \text{escape angle} \right] $$

This compressed state space reduces computational complexity while focusing on key decision factors for the JUYE UAV.

Algorithm Design

Our GC-PPO algorithm combines GRU-based temporal feature extraction with PPO-Clip in a hierarchical framework. The policy generation layer uses GC-PPO to produce high-level strategies, while the control layer employs PID controllers to convert these strategies into control commands. This approach addresses the challenges of controlling 6-DOF Unmanned Aerial Vehicles and ensures stable flight after high-maneuver actions.

The algorithm workflow is as follows: First, the current state $s_t$ and previous state $s_{t-1}$ are stacked to form a sequence. GRU is used to fuse this sequence and extract hidden features, resulting in a fused state $s_{\text{fused}}$. The New Actor network then generates a high-level policy, outputting mean and standard deviation for actions like ideal roll, altitude, and speed. These actions are sampled from a normal distribution:

$$ a_{\text{ideal}} \sim \mathcal{N}(\mu, \sigma) $$

The Critic network evaluates the value of the fused state. PID controllers translate the high-level actions into real control commands, which are executed by the environment. Experiences are stored in a replay buffer, and when the buffer reaches a certain size, the algorithm updates the Actor and Critic networks using advantage estimates and Cosine Annealing for learning rate adjustment.

The reward function incorporates Reward Shaping to solve sparse reward problems. We use potential-based reward shaping, which provides intermediate rewards based on height, speed, and relative distance changes. The step reward is defined as:

$$ R_{\text{step}} = \lambda_H R_H + \lambda_V R_V + \lambda_d R_d $$

where $R_H$, $R_V$, and $R_d$ are rewards for height, velocity, and distance, respectively, and $\lambda_H$, $\lambda_V$, $\lambda_d$ are coefficients. The height reward encourages the Unmanned Aerial Vehicle to stay within safe altitudes, the velocity reward maintains optimal speed, and the distance reward promotes increasing separation from the missile. The final reward includes terminal conditions for success or failure, such as being hit or escaping.

Table 2: Reward Function Components for Unmanned Aerial Vehicle
Component	Formula
Height Reward	$R_H = \begin{cases} -5 & \text{if } h < 2000 \\ 35 – 2e^{-(h-2000)/300} & \text{if } 2000 \leq h < 4000 \\ 3 & \text{if } 4000 \leq h < 8000 \\ 35 – 2e^{-(10000-h)/300} & \text{if } 8000 \leq h < 10000 \\ -10 & \text{if } h \geq 10000 \end{cases}$
Velocity Reward	$R_V = \begin{cases} -35 & \text{if } v < 100 \\ 35 – 2e^{-(v-100)/200} & \text{if } v \geq 100 \end{cases}$
Distance Reward	$R_d = \begin{cases} 1 & \text{if } \dot{d} > \dot{d}_{\text{prev}} \\ -1 & \text{if } \dot{d} < \dot{d}_{\text{prev}} \end{cases}$

The overall reward function is:

$$ R = \begin{cases} \lambda_H R_H + \lambda_V R_V + \lambda_d R_d & \text{during simulation} \\ \lambda_H R_H + \lambda_V R_V + \lambda_d R_d + R_D & \text{at simulation end} \end{cases} $$

where $R_D$ is the terminal reward based on outcomes like avoidance success or failure. This reward structure guides the JUYE UAV to learn effective avoidance strategies quickly.

Simulation Experiments

We design simulation experiments to validate our GC-PPO algorithm in 1v1 BVR air combat scenarios. The hardware and software environments are configured as follows:

Table 3: Simulation Environment Configuration
Category	Configuration
Operating System	Windows 10 Pro 22H2
CPU	Intel(R) Core(TM) i9-13900KF
GPU	NVIDIA GeForce RTX 3060
Memory	32GB
Flight Simulator	JSBSim
Programming Language	Python
Deep Learning Framework	PyTorch 2.0.1

The Unmanned Aerial Vehicle model is based on the F-16 6-DOF dynamics, with initial positions and velocities randomly generated within specified ranges. The enemy aircraft launches a missile when the Unmanned Aerial Vehicle enters its range. The goal is to evade the missile while maintaining stable flight. Key parameters include a combat area of 50 km × 50 km, altitude range of 1000 m to 10000 m, and missile damage radius of 80 m.

We compare GC-PPO with other algorithms: PPO without reward shaping, PPO with reward shaping, and hierarchical PPO without GRU. The neural network parameters are consistent across experiments, with 3 layers, 256 hidden units, actor and critic learning rates of 0.0003, discount factor of 0.99, and batch size of 64.

First, we determine the optimal GRU sequence window length by testing 2, 3, 5, and 10 frames. The training average reward curves show that a 2-frame window achieves the fastest convergence, detecting optimal strategies around 80,000 episodes and converging by 210,000 episodes. Longer windows slow down training due to increased computational cost. The frames per second (FPS) for different window lengths are:

Table 4: FPS for Different GRU Window Lengths
Window Length	FPS
2	332
3	287
5	221
10	140

Thus, we use a 2-frame window for GC-PPO. In training over 300,000 episodes, GC-PPO shows a steady increase in average reward, converging faster than other algorithms. The average reward curves indicate that GC-PPO reaches higher rewards earlier, while other methods plateau or converge slowly. The average miss distance, which should exceed the missile damage radius of 80 m for successful avoidance, is significantly higher for GC-PPO, demonstrating its effectiveness.

For example, the average miss distance for GC-PPO exceeds 100 m after 180,000 episodes, ensuring survival. In contrast, other algorithms struggle to maintain safe distances. Trajectory analysis in 3D views shows that the Unmanned Aerial Vehicle performs high-maneuver actions like diving and turning to escape missile pursuit, leveraging the PID control for stability.

Multi-Scenario Validation

To test generalization, we evaluate the trained agent in three distinct scenarios with different initial conditions. The parameters for each scenario are:

Table 5: Multi-Scenario Experiment Parameters
Scenario	Friendly Initial Heading (°)	Friendly Initial Altitude (m)	Enemy Initial Altitude (m)	Friendly Initial Velocity (m/s)	Enemy Initial Velocity (m/s)
1	254	4000	7000	250	288
2	0	7983	7188	250	251
3	40	7686	7705	250	286

In Scenario 1, the Unmanned Aerial Vehicle starts at a low altitude with the enemy above, creating a disadvantaged situation. The agent successfully performs a 180-degree turn at low altitude to evade the missile. In Scenario 2, with the enemy directly behind, the agent dives and turns to increase separation. In Scenario 3, a more balanced scenario, the agent chooses smoother maneuvers to maintain stability while avoiding the missile. These results highlight the robustness and adaptability of our GC-PPO algorithm for the JUYE UAV in various BVR air combat contexts.

Conclusion

We propose a GC-PPO algorithm within a hierarchical framework to address missile avoidance in BVR air combat for 6-DOF Unmanned Aerial Vehicles. The integration of GRU for temporal feature extraction and PID for control stabilization effectively handles the challenges of high coupling and control difficulties. Reward Shaping accelerates convergence by providing intermediate guidance, and Cosine Annealing optimizes the learning process.

Simulation results demonstrate that GC-PPO outperforms traditional PPO and other variants in convergence speed, generalization, and survival rate. The Unmanned Aerial Vehicle achieves stable flight and successful missile evasion across diverse scenarios. Future work will focus on reducing computational costs while maintaining performance, possibly through more efficient network architectures or distributed training. This approach has significant implications for enhancing the autonomy and survivability of Unmanned Aerial Vehicles like the JUYE UAV in complex air combat environments.