Attitude Control of Quadcopter Using Deep Reinforcement Learning

In recent years, the significant reduction in the cost of micro-sensors and embedded computing devices has promoted the widespread application of quadcopters in various fields. Quadcopters, due to their high integration, low loss, ease of operation, vertical take-off and landing capabilities, and the ability to perform rapid and flexible maneuvers, have been extensively used in areas such as agricultural plant protection, aerial photography, and material transport. The flexibility, autonomy, and potential for mass deployment of quadcopters rely heavily on underlying attitude control technology. As a critical component of safe and reliable flight technology, the quality of the flight controller directly determines the操控灵敏度 and applicability of the quadcopter. Research on control technologies for quadcopters primarily focuses on linear control, nonlinear control, and intelligent control. Linear control is the most common and simplest method for control law design, but it only satisfies local stability. Linear control methods mainly include Proportional-Integral-Derivative (PID) control, H∞ control, and Linear Quadratic Regulator (LQR) control. Nonlinear control considers the nonlinearity of the quadcopter system, with main methods including sliding mode control, backstepping, and various adaptive controls.

Traditional control methods struggle to address stability and precise control issues in practical aircraft applications. Intelligent control refers to control systems implemented using artificial intelligence methods and is an effective control approach for highly nonlinear and strongly coupled systems. To tackle robustness issues, some researchers use neural networks to approximate unknown nonlinear parts or disturbances in the aircraft model, ensuring robustness under model parameter uncertainties and input saturation. Others employ neural network-based intelligent control systems to learn the dynamics model of the quadcopter online for specific trajectory navigation. This method allows the aircraft to adapt in real-time to external disturbances and unknown environmental scenarios, but it is challenging to maintain stability during the initial phase for an extended period. To overcome the shortcomings of such online learning methods, offline learning can be used to build an initially stable controller. For constructing offline learning models, supervised learning methods can train intelligent flight controllers; however, the drawback is that the training data may not follow the same distribution as the actual environmental model data, so the obtained model may not fully reflect the potential system dynamics. In summary, supervised learning is not an ideal solution for interactive problems such as control.

Existing control methods used on aircraft have certain limitations in practical applications. Another class of methods in machine learning that can model continuous control problems is reinforcement learning. Reinforcement learning is an efficient approximate optimization method and a powerful tool for solving optimal control problems of complex nonlinear systems. This method does not require prior knowledge of the aircraft’s dynamics model and can effectively construct optimal control policies. To achieve precise control of the aircraft and avoid extensive trial-and-error iterations, some studies integrate reinforcement learning with classical controllers to achieve simple, easy-to-use, and rapidly converging control methods. Others propose online adaptive controller design methods for handling optimal control problems of flight control systems with unknown dynamics. Some researchers use deep reinforcement learning algorithms to train fixed-wing flight controllers, achieving performance comparable to PID controllers in terms of overshoot and response speed. Additionally, some propose reinforcement learning-based neural controller designs for quadcopters to perform full take-off and landing operations under varying wind conditions. To address the instability and asymmetry issues of traditional reinforcement learning, some introduce symmetric actors and critics, integrating the intrinsic properties of the system into the neural network architecture and overcoming small-angle limitations.

In summary, reinforcement learning has advantages such as strong adaptability and the ability to handle complex nonlinear systems, but it still faces issues like discrete output actions and model convergence difficulties. Outputting continuous actions increases the action sampling space, leading to model convergence challenges. Meanwhile, existing flight controllers have defects such as large overshoot, poor fast response capability, and complexity in manual parameter tuning. To solve these problems, this study focuses on the underlying attitude control of quadcopters, combining the flight characteristics of quadcopters to develop a simulation model, and uses the Proximal Policy Optimization (PPO) model-free deep reinforcement learning algorithm to train a quadcopter attitude controller. Experimental results show that the attitude controller trained with the PPO algorithm has smaller tracking errors, lower overshoot, and faster response speeds, verifying the feasibility and effectiveness of this intelligent controller.

The movement of a quadcopter has six degrees of freedom: three rotational degrees of freedom [p, q, r] and three translational degrees of freedom [x, y, z]. The input to the motion control system is the rotational speed of the four motors. The configuration of the motors determines the aerodynamic effects of the quadcopter. Common configuration forms are divided into “+” configuration and “X” configuration. This study adopts an “X” configured quadcopter, which has better stability compared to the “+” configuration. To achieve stable flight, the control signals for each motor need to be calculated to realize flight control. The effect of motor speed changes on the lift and Euler angles of the quadcopter is represented by the following equation:

$$ u_f = K_T (\omega_1^2 + \omega_2^2 + \omega_3^2 + \omega_4^2) $$
$$ u_\phi = K_T (-\omega_1^2 – \omega_2^2 + \omega_3^2 + \omega_4^2) $$
$$ u_\theta = K_T (-\omega_1^2 + \omega_2^2 + \omega_3^2 – \omega_4^2) $$
$$ u_\psi = K_Q (-\omega_1^2 + \omega_2^2 – \omega_3^2 + \omega_4^2) $$

where $\omega_i$ (i=1,2,3,4) represents the angular velocity of each motor, $u_f$, $u_\phi$, $u_\theta$, $u_\psi$ are the lift, roll, pitch, and yaw effects, respectively, $K_T$ is the thrust coefficient, and $K_Q$ is the motor反扭矩 coefficient. Motor rotation generates lift $f$, and motor speed directly affects the attitude Euler angles: roll angle $\phi$, pitch angle $\theta$, and yaw angle $\psi$. The torque acting on the quadcopter is:

$$ \begin{bmatrix} \tau_\phi \\ \tau_\theta \\ \tau_\psi \end{bmatrix} = \begin{bmatrix} l u_\phi / \sqrt{2} \\ l u_\theta / \sqrt{2} \\ u_\psi \end{bmatrix} $$

where $l$ is the arm length of the frame, $\tau_\phi$, $\tau_\theta$, $\tau_\psi$ are the three-axis torques. To maintain stable flight, quadcopters are usually equipped with an inertial navigation module consisting of a three-axis gyroscope and a three-axis accelerometer. This module can calculate the aircraft’s attitude relative to the ground, as well as the three-axis angular velocity and acceleration. The flight control computing board regulates the motor signals through electronic speed controllers, controlling the motor speed to ensure the rotors generate appropriate force. Changes in the speed of the four motors can alter the rotor speed, thereby regulating the lift and controlling the attitude and position of the quadcopter.

The pitch and roll motions of the aircraft are achieved by thrust differences, and the yaw motion is achieved by torque differences. The change in motor speed affects the motion of the quadcopter. Red arrows indicate relatively higher speeds, and blue indicates relatively lower speeds. For precise control, an electronic prototype model was built. The geometric structure of the aircraft is crucial to flight performance. Considering the trade-off between computational speed and modeling accuracy, the prototype body model was partially abstracted and simplified. The body is mainly divided into the following parts: frame, motors, battery, and flight control components. Subsequently, software was used to generate corresponding mesh files and import them into Gazebo for simulation. To determine the inertia matrix of each component, an electronic scale was used to measure the mass $m$ of each component. Assuming the mass distribution of each component is uniform, the volume, centroid, and moment of inertia $I’$ of each 3D component were calculated. The calculated moment of inertia $I’$ needs to be scaled for length and density to obtain the real moment of inertia $I$, calculated as follows:

$$ I = I’ \cdot \text{unit\_scale}^2 \cdot \frac{m}{V} $$

where $V$ is the volume, $I’$ is the calculated unit moment of inertia, and $\text{unit\_scale}$ is the scaling factor.

Mass of Aircraft Components
Component	Mass (g)
Battery	184.2
Frame	82.0
Motor	33.0
Propeller	5.3
Flight Control Component	106.0

To achieve attitude control, the angular velocity information of the quadcopter needs to be obtained, so the Inertial Measurement Unit (IMU) sensor needs to be modeled. To improve simulation accuracy, sensor noise in the real world needs to be considered. This study adds a random noise sampled from a Gaussian distribution to the sensor plugin. To more accurately reflect real-world noise, the mean and variance of the Gaussian noise of the three-axis IMU of the experimental aircraft were actually measured. The measurement method was: fix the aircraft on a ground stand, unlock it on level ground, sample a set of IMU data every 100 ms, and collect a total of 5977 sample data. The IMU data at this time can be regarded as sensor noise. The histogram of the three-axis noise is drawn, and the experimental data chart shows that the three-axis IMU sensor noise approximately follows a normal distribution. The measured noise data is fitted to a normal distribution, and the mean and variance of each axis’s noise are calculated. The obtained noise data will be sent to the simulation environment through the environment configuration file.

IMU Noise Mean and Standard Deviation
Axis	Mean $\mu$	Standard Deviation $\sigma$
Roll ((°)/s)	-0.071	0.14
Pitch ((°)/s)	0.64	0.14
Yaw ((°)/s)	0.29	0.13

The motor model uses blade element theory to model the thrust and torque of the propeller. The performance of the propeller is determined by $C_T$ and $C_Q$. $C_T$ can be calculated by:

$$ C_T = \frac{T}{\rho n^2 D^4} $$

where $T$ is the thrust, $\rho$ is the air density, $n$ is the rotational speed of the propeller, and $D$ is the diameter of the propeller. $C_Q$ can be calculated by:

$$ C_Q = \frac{Q}{\rho n^2 D^5} $$

where $Q$ is the torque; $\rho$, $n$, $D$ have the same meanings as in equation (4). $C_T$ and $C_Q$ are functions of the advance ratio, which quantifies the relationship between the forward speed of the quadcopter and the rotational linear speed of the propeller tip, and can be calculated by:

$$ J = \frac{V_\infty}{n D} $$

where $V_\infty$ is the free stream velocity. When the propeller is stationary, $V_\infty = 0$, the advance ratio $J = 0$. When the rotational speed generated by each propeller is $\omega$, the thrust generated by each propeller is:

$$ T(\omega) = K_T \omega^2 $$

where $K_T$ is the thrust coefficient, which is determined by the geometric characteristics of the propeller, air density, and other factors. The specific calculation is given by:

$$ K_T = \frac{C_T \rho D^4}{(2\pi)^2} $$

Further, the propeller torque can also be calculated by:

$$ Q(\omega) = K_Q \omega^2 $$

where $K_Q$ is the motor反扭矩 coefficient. This coefficient is proportional to $C_Q$ and the propeller diameter $D$, and inversely proportional to $C_T$. Its definition is:

$$ K_Q = \frac{C_Q \rho D^5}{(2\pi)^2} $$

The motor model requires experimental measurement when the propeller is stationary, and subsequent calculation to obtain $C_T$, $C_Q$ and $K_T$, $K_Q$. To accurately model the power system, the performance parameters of the propeller need to be obtained through experiments. The experimental environment parameters are altitude 44 m (Beijing), temperature 20°C. The quadcopter equipped with a single propeller is fixed on a test stand, and the throttle is fully opened to operate at maximum speed to obtain the performance parameters of the power system.

Performance Parameters of Propellers
Parameter	Value
Atmospheric Density (kg/m³)	1.1986
Blade Diameter (m)	0.133
Maximum Thrust (N)	5.39
Maximum Torque (N·m)	0.5
Maximum RPM (r/min)	16000

Combining equations (4), (5), (8), (10), $K_T = 1.920 \times 10^{-6}$ and $K_Q = 1.781 \times 10^{-7}$ can be calculated. Subsequently, the position, geometric properties, and dynamic characteristics of the aircraft model are simulated in the Gazebo environment. The established aircraft model is shown in the figure, with red, green, and blue circles representing the roll, pitch, and yaw motions of the quadcopter, respectively.

The theoretical basis of reinforcement learning is the Markov Decision Process (MDP), which is a mathematical description used to solve sequential decision-making problems. The first step in building a deep reinforcement learning attitude controller for a quadcopter is to model the quadcopter angular velocity control using the Markov Decision Process $M = (S, A, P, R, \gamma)$, including the state space $S$, action space $A$, state transition probability $P$, reward function $R$, and discount factor $\gamma$.

The design of the state space mainly considers the relevant sensor parameters of the quadcopter. For a quadcopter, optional sensor parameters include: IMU information, electronic speed controller information, position information, battery information, etc. This study selects the state quantity as $x(t) = [e(t), \Delta e(t)]$, where $e_t$ is the error between the actual angular velocity and the target angular velocity at time t: $e_t = \Omega – \Omega^*$ ($\Omega$: gyroscope measurement value; $\Omega^*$: target signal command); $\Delta e_t$ is the difference between the measured angular velocity error at time t and time t-1: $\Delta e_t = e_t – e_{t-1}$. At time t, the state of the agent is a six-dimensional vector: $[e_\phi, e_\theta, e_\psi, \Delta e_\phi, \Delta e_\theta, \Delta e_\psi]$. The agent receives the three-axis angular velocity information from the IMU sensor, then calculates the three-axis measured angular velocity error, and makes attitude signal adjustment decisions based on this.

In manual mode, the $\Omega^*$ signal comes from the remote controller; in automatic flight mode, the $\Omega^*$ signal comes from the automatic flight command of the onboard computer, and the $\Omega$ data is read by the onboard gyroscope. It is worth noting that to simplify calculations, the state space only includes the above state information and does not include other sensor parameters and real physical world environmental parameters. Therefore, the agent only partially represents the environment, and the incomplete perception of the state is an important source of the gap between the simulation environment and real flight.

The design of the action space needs to consider the action execution method and capability of the agent. Based on the flight dynamics characteristics of the quadcopter, this study directly uses the motor throttle amount as the action output. The agent processes the state information through the neural network to make decisions, and this action is converted into an electrical signal value $y_t$ acting on the electronic speed controller. This signal is a four-dimensional vector $[y_0, y_1, y_2, y_3]$, representing the percentage of the maximum output on each electronic speed controller. This signal directly acts on the motor, controlling the speed of the quadcopter, and the output speed is between 0 and the maximum speed.

In the deep reinforcement learning training task, the agent’s reward and behavior form feedback, and the reward value directly guides the agent’s action output. This study, based on the control principle of the quadcopter and experimental experience, designs the reward function with the following specific considerations. In the attitude control task, the first consideration should be the angular rate signal tracking error reward $r_t$: the deviation between the desired signal and the actual output control signal. We expect this term to be as small as possible, so this reward should be negative.

$$ r_t = -\alpha_1 \cdot (e_\phi^2 + e_\theta^2 + e_\psi^2) $$

In addition, high-frequency signal oscillations can cause rapid changes in the control output current, resulting in heat accumulation and damage to the motor. Therefore, it is also desired that the output signal of the controller has small oscillations between different time steps and has sequentiality of control actions, introducing an oscillation reward $r_o$.

$$ r_o = -\alpha_2 \cdot \max(|\Delta y|) $$

where $\Delta y$ is the change in the output signal. The same attitude can be achieved by different control signals. It is necessary to make the output control signal as small as possible to save power. In actual control, excess control may also cause attitude changes, so a minimum output reward $r_{\text{min}}$ is introduced.

$$ r_{\text{min}} = \alpha_3 \cdot (1 – \bar{y}) $$

where $\bar{y}$ is the average output of the agent’s four-dimensional action. The output value of the agent is less than 1, so this term calculates the remaining average available output size. The larger the remaining average output, the greater the reward. The controller can easily generate extremely large control amounts at the step change of the signal. This is because when the signal changes, the controller tries to achieve the fastest tracking of the signal change through maximum output. The output of the neural network is essentially random, and the controller may sometimes output commands at the actual output上限. To solve this problem, an oversaturation reward $r_{\text{over}}$ is introduced.

$$ r_{\text{over}} = -\alpha_4 \cdot \text{COUNTIF}(y_i > 1) $$

Furthermore, through the analysis of historical training data, due to the introduction of too many negative rewards, the controller tends to fall into a “passive” state where no signal is output, to obtain a 0 reward. To address this situation, a passive reward $r_n$ is introduced: when the three-axis angular velocity desired signals are not all zero, if there are more than two signal values in the output signal vector that are 0, a negative reward with a larger absolute value is given; during the attitude control phase, it should not happen that all output signals are 1. If all four values of the controller’s output signal vector are 1, it can also be considered that the controller is passively responding to the desired signal, and a negative reward with a larger absolute value is also given.

$$ r_n = -\alpha_5 \cdot \exists y_i \equiv 0,1 $$

In summary, the total reward is the sum of the above rewards, as shown in equation (16):

$$ R = r_t + r_o + r_{\text{min}} + r_{\text{over}} + r_n $$

where $\alpha = [\alpha_1, \alpha_2, \alpha_3, \alpha_4, \alpha_5]$ are adjustable hyperparameters.

Based on the above description of the reinforcement learning modeling of the quadcopter attitude signal control task, this section explains the framework structure of the flight control algorithm based on deep reinforcement learning. One of the cores of agent training is to build a training model in a physical simulation environment to restore the actual task scenario. GymFC is an open-source flight controller training environment that can achieve data interaction with the quadcopter model in Gazebo, reducing the reality gap. At the beginning of each training episode, the agent issues an environment reset command. At time t, the agent observes the state quantity $s_t$ from the simulation environment E, and then these state quantities are sent to the reinforcement learning environment GymFC. The agent outputs the action $a_t$ based on the current policy control neural network, and the action is converted into a percentage electrical signal $y_t$ that acts on the electronic speed controller, controlling the motor speed thereby controlling the quadcopter attitude.

Subsequently, the agent obtains a reward value in the environment, which represents the quality of the action: the larger this value, the better the behavior under the given state conditions, and the more effectively and accurately the behavior can track the target signal. As the training rounds iterate, the reward obtained by the controller gradually increases, ultimately achieving the goal of accurate and stable angular rate tracking for the quadcopter. Overall, the intelligent controller needs to complete the following tasks during the training process: 1) Read the target angular velocity of the remote control command. 2) Read the current angular velocity of the onboard gyroscope. 3) Intelligent controller state evaluation and signal output. 4) Transmit the signal to the electronic speed controller for speed control.

This study designs a quadcopter attitude controller based on Proximal Policy Optimization (PPO). The PPO algorithm is widely used in academia and industry due to its excellent performance, ease of implementation, and simple debugging. The PPO algorithm proposes a new objective function, enabling mini-batch updates over multiple training steps, solving the problem of difficult step size determination in policy gradient algorithms. PPO achieves a new balance in ease of implementation, sample complexity, and debugging effort. Each iteration of PPO not only minimizes the loss function but also attempts to compute a new policy while ensuring that the deviation between the new policy and the old policy from the previous iteration remains within a relatively small range. PPO not only performs well in continuous control tasks but is also simpler to implement, more general, and has better sample complexity.

PPO is based on two networks, the Actor network and the Critic network, and is a typical actor-critic framework algorithm. Define the state-action sequence interacted by the agent in the environment as $\tau = \{s_1, a_1, s_2, a_2, …\}$. Denote the objective function for each update of the policy $\pi$ as $\eta(\pi_\theta)$, then the objective function can be written as:

$$ \eta(\pi_\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \left[ \sum_{t=0}^{\infty} \gamma^t r(s_t) \right] $$

According to the definition of the advantage function, the state value function can be used to estimate the advantage function:

$$ A_{\pi}(s, a) = Q_{\pi}(s, a) – V_{\pi}(s) $$

When updating the objective function $\eta(\pi)$, the new policy’s objective function has the following relationship with the old policy’s advantage function and objective function:

$$ \eta(\pi_\theta) = \eta(\pi_{\theta_{\text{old}}}) + \mathbb{E}_{\tau \sim \pi_{\theta_{\text{old}}}} \left[ \sum_{t=0}^{\infty} \gamma^t A_{\pi_{\theta_{\text{old}}}}(s_t, a_t) \right] $$

The expectation in this objective function is difficult to calculate, so importance sampling technology is generally used to estimate this expectation. The objective function $\eta(\pi_\theta)$ in equation (19) is reconstructed as:

$$ \eta(\pi_\theta) \approx \eta(\pi_{\theta_{\text{old}}}) + \sum_s \rho_{\pi_{\theta_{\text{old}}}}(s) \sum_a \pi_\theta(a|s) A_{\pi_{\theta_{\text{old}}}}(s, a) $$

where $\rho_{\pi}(s) = P(s_0 = s) + \gamma P(s_1 = s) + \gamma^2 P(s_2 = s) + \cdots$, $r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{\text{old}}}(a_t|s_t)}$ is the ratio of new to old policies, which represents the magnitude of the policy update. The reconstructed objective function中 $\eta(\pi_{\theta_{\text{old}}})$ is independent of the current policy $\pi$. Using the advantage function can significantly reduce the estimation variance of the action value function. This objective function can be rewritten as:

$$ \eta(\pi_\theta) = \mathbb{E}_{\tau} \left[ r_t(\theta) \hat{A}_t \right] $$

To find the optimal policy $\pi_\theta^*$, it is necessary to maximize $\eta(\pi_\theta)$. In the optimization process using policy iteration, the update magnitude of the policy gradient is unstable, and the policy may shift significantly before and after the update, deviating from the desired update direction. To solve this problem, the TRPO algorithm proposed by Schulman et al. uses KL divergence to limit the magnitude of each policy update, with the constraint goal to limit the KL divergence between the new and old policies within a certain threshold range. This constraint can ensure steady policy updates, preventing the updates from diverging. The process of finding the optimal policy can be recorded as:

$$ \begin{aligned}
& \text{Maximize} \quad \hat{\eta}(\pi_\theta) = \mathbb{E}_{\tau} \left[ r_t(\theta) \hat{A}_t \right] \\
& \text{s.t.} \quad \mathbb{E}_{\tau} \left[ \text{KL}[\pi_{\theta_{\text{old}}}(\cdot|s_t), \pi_\theta(\cdot|s_t)] \right] \leq \delta
\end{aligned} $$

The PPO algorithm is an improvement on the TRPO algorithm. Depending on the method of limiting the step size update, PPO can be divided into PPO-Penalty and PPO-Clip algorithms. This study subsequently uses the PPO-Penalty algorithm. An adaptive KL divergence penalty coefficient is used to limit the update magnitude, and the KL divergence is incorporated into the objective function. Its objective function is defined as:

$$ \eta_{\text{penalty}}(\pi_\theta) = \mathbb{E}_{\tau} \left[ r_t(\theta) \hat{A}_t \right] – \beta \mathbb{E}_{\tau} \left[ \text{KL}[\pi_{\theta_{\text{old}}}(\cdot|s_t), \pi_\theta(\cdot|s_t)] \right] $$

The penalty coefficient $\beta$ will be dynamically adjusted according to the value of the KL divergence. If the current KL divergence is greater than the KL divergence threshold, then $\beta$ is increased, making the policy update more conservative; if the current KL divergence is less than the KL divergence threshold, then $\beta$ is decreased, making the policy update more aggressive. The PPO algorithm uses a fixed update length U, each time selecting N sequences of length U, i.e., each time performing data updates for NU time steps. Its algorithm pseudocode is as follows.

PPO Algorithm

1 Input: $\theta_0$, $\mu_0$

2 Output: $\theta^*$, $\mu^*$

3 Initialize policy network and value network weights: $\theta \leftarrow \theta_0$, $\mu \leftarrow \mu_0$

4 for k = 1,2… do:

5 for u = 1, 2… do:

6 The agent interacts with the environment according to policy $\pi_k$ to collect trajectory $\tau$

7 Compute reward $\hat{R}_t$

8 Compute advantage value $\hat{A}_t$ based on current value function $V_{\mu_k}$

9 Update policy network by maximizing PPO-penalty objective function:

$\eta_{\text{penalty}}(\pi_\theta) = \mathbb{E}_{\tau} \left[ r_t(\theta) \hat{A}_t \right] – \beta \mathbb{E}_{\tau} \left[ \text{KL}[\pi_{\theta_{\text{old}}}(\cdot|s_t), \pi_\theta(\cdot|s_t)] \right]$

10 if $\mathbb{E}_{\tau} \left[ \text{KL}[\pi_{\theta_{\text{old}}}(\cdot|s_t), \pi_\theta(\cdot|s_t)] \right] > \text{KL}_{\text{high}}$ do:

$\beta \leftarrow \lambda \beta$

11 else if $\mathbb{E}_{\tau} \left[ \text{KL}[\pi_{\theta_{\text{old}}}(\cdot|s_t), \pi_\theta(\cdot|s_t)] \right] < \text{KL}_{\text{low}}$ do:

$\beta \leftarrow \beta / \lambda$

12 end if

13 Update value network by minimizing mean square error:

$L_V(\mu) = \sum_{t=0}^{T} (V_{\mu}(s_t) – \hat{R}_t)^2$

14 end for

15 end for

The PPO algorithm-based reinforcement learning attitude controller uses a fully connected layer neural network, with input as a six-dimensional vector, hidden layer number as 2, output as a four-dimensional vector, and uses Stochastic Gradient Descent (SGD) as the optimizer during controller training. For comparative research, this study also uses a PID controller and a reinforcement learning attitude controller based on the Soft Actor-Critic (SAC) algorithm. The specific details of hyperparameter settings are shown in the table below.

Hyperparameter Setting of Attitude Control Algorithm Based on PPO and SAC
PPO Algorithm Parameter	Value	SAC Algorithm Parameter	Value
Total Training Steps	10^7	Total Training Steps	10^7
Max Time Steps T	2048	Max Time Steps T	2048
Number of Epochs K	5	Number of Epochs K	5
Batch Size M	32	Batch Size M	32
Reward Discount Factor $\gamma$	0.99	Reward Discount Factor $\gamma$	0.99
Penalty Factor $\lambda$	0.95	Experience Buffer Size	50000
		Soft Update Coefficient $\tau$	0.005

The PID-based flight attitude controller requires tuning of 9 parameters. The nine control parameters are the three-axis proportional, integral, and derivative terms $[K_P, K_I, K_D]$. The Ziegler–Nichols method is used to set the proportional, integral, and derivative gain parameters of the PID controller. The input of the PID controller in the simulation environment is the angular velocity error. After simulation environment optimization, the parameters for each axis are obtained as shown in the table below.

Three-Axis PID Controller Parameters
Category	Value	$K_P$	$K_I$
Roll	0.6	32	0.045
Pitch	2.4	64.62	0.068
Yaw	4.2	5	0.02

The reward change of the PPO algorithm controller for the first 1 million training steps is shown in the figure. In the early stages of training, the agent learns quickly, and the reward increases rapidly. The PPO controller training curve shows the reward value change throughout the training process. It can be seen that the training converges at nearly 4 million steps, and then the reward value remains at a high level, which means the agent can already较好地 complete the attitude tracking task. To encourage the agent to perform more exploration and avoid falling into local optima, the input of the controller during training follows a Gaussian distribution. When evaluating network performance, the output of the controller is deterministic.

The reward value change of the SAC algorithm for the first 1 million steps of training is shown in the figure. The reward value for the first 1 million steps of training has greater oscillations. The reward value change during the entire training period is shown in the figure. The training converges at nearly 5 million steps, and then the reward value remains at a high level, which means the agent can较好地 achieve the attitude tracking task. Through data analysis, it is found that the control signal output by the SAC controller has large oscillations, but after a period of training and learning, the controller can eventually achieve stable output.

After training, error analysis is performed on the PPO controller, SAC controller, and PID controller. Random target signals are given, the controller selects actions to output, and then the motor response, acceleration information, and three-axis angular velocity information of the controller tracking the signal are obtained. To quantitatively measure the signal tracking effect of the controller, the following indicators are introduced: Mean Absolute Error (MAE): This indicator is the average of the absolute values of the differences between all observed values and the data average, calculated by:

$$ \text{MAE} = \frac{1}{n} \sum_{i=1}^{n} |x_i – E| $$

Mean Square Error (MSE): The average of the sum of squares of the differences between all data observed values and the average value, calculated by:

$$ \text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (x_i – E)^2 $$

The analysis results of the mean absolute error and mean square error of the PPO, SAC, and PID controllers are shown in the table below, where p, q, r represent the roll, pitch, and yaw angular velocity errors, respectively.

Error Analysis of Controller (Unit: (°)/s)
Controller	Category	p	q	r	Average
PPO	MAE	0.33	0.56	0.92	0.60
PPO	MSE	2.11	9.05	18.69	9.95
SAC	MAE	7.98	4.55	6.56	6.36
SAC	MSE	578.87	85.43	156.08	273.46
PID	MAE	7.08	4.71	7.58	6.26
PID	MSE	644.31	281.29	647.83	524.48

The PPO controller can accurately and quickly track the target signal, with an average absolute error of only 0.6 (°)/s for tracking the three-axis angular velocity signal; the error of the SAC controller is larger than that of the PPO controller, but it can also complete the attitude tracking task; the PID controller has the largest mean absolute error, and the output signal has a持续 larger deviation from the reference signal.

In this section, we evaluate and compare the performance of the three attitude controllers in the simulation environment. Pulse step signals are used to evaluate the controllers, giving the same input signal to different controllers to observe the responses of different controllers and compare their performance indicators. When the final output signal of the controller falls within ±10% of the target signal, it can be considered that the controller has successfully regulated. To evaluate and compare the different characteristics of the controllers, when the controller successfully regulates, the following performance indicators of the controller are introduced: rise time, settling time, overshoot, and steady-state error.

The control effects of the three controllers are plotted in the figure (p, q, r represent the angular velocities in the roll, pitch, and yaw directions, respectively). To compare the controller performance more intuitively, the image from 0.49 s to 1.53 s is selected and放大 processed as shown in the figure. The upper and lower bounds of the target angular velocity signal that satisfy successful regulation are marked by black dashed lines. It can be seen that all three controllers achieve successful regulation on the three axes. Tables 7 to 10 give the quantitative performance indicator data of the three controllers’ rise time, settling time, overshoot, and steady-state error during the period from the signal step start (512 ms) to 1530 ms.

Rise Time Comparison (Unit: ms)
Category	p	q	r	Average
PPO	34	25	57	38.7
SAC	70	74	194	112.7
PID	82	64	105	86.7

Adjustment Time Comparison (Unit: ms)
Category	p	q	r	Average
PPO	87	90	57	78
SAC	348	480	273	363
PID	227	208	105	180

Overshoot Comparison
Category	p/%	q/%	r/%	Average/%
PPO	54.8	40.3	4.2	33.1
SAC	67.4	112.7	27	69
PID	80.3	104.9	4.98	63.4

Steady-state Error Comparison
Category	p/%	q/%	r/%	Average/%
PPO	0.07	0.54	1.56	0.72
SAC	2.75	4.68	5.32	4.25
PID	0.75	0.26	1.17	0.73

The PPO controller has excellent overall performance indicators and reacts very灵敏. Compared with the PID controller, its average rise time is reduced by 55.4%, and the settling time is reduced by 56.7%. The PID controller has large overshoot in the roll and pitch motion directions, while the PPO controller has relatively small overshoot on all three axes, with an average overshoot of about 33.1%, which is 47.8% less than PID. In terms of steady-state error indicators, PPO is similar to PID, both maintained at a low level. The SAC controller大体 completed the control task. Its output signal oscillates repeatedly, with long settling time and rise time. Its average rise time is 1.3 times that of the PID controller, and the settling time is 1.6 times. The SAC controller has smaller overshoot than PID in roll motion, and the three-axis average overshoot is similar to PID, but it maintains a large steady-state error.

In conclusion, this study addresses the defects of quadcopters such as complex manual parameter tuning, large control overshoot, and poor fast response capability during maneuverable flight or under interference from恶劣 meteorological conditions, and introduces a deep reinforcement learning control method. First, a quadcopter model is established, and then research on the underlying control method based on deep reinforcement learning is carried out, designing and obtaining PPO controller and SAC controller, and performing error analysis. Then, pulse step signals are used to evaluate the performance of the three controllers, comparing the performance indicators of different controllers. Simulation results show that compared with PID and SAC controllers, the PPO controller has the lowest rise time and settling time, the smallest overshoot, and can maintain relatively stable output, achieving accurate and rapid attitude changes and stable flight of the quadcopter.