Quadrotor Flight Control Using KPPO Algorithm

In recent years, the rapid advancement of intelligent unmanned systems has led to the widespread deployment of robots in high-risk and complex scenarios, such as industrial inspection, emergency rescue, and military reconnaissance. Among aerial platforms, quadrotor unmanned aerial vehicles (UAVs) have demonstrated remarkable capabilities in applications like aerial remote sensing and disaster detection, owing to their autonomous target search abilities. However, existing flight control systems often struggle with adaptability in dynamic and uncertain environments, limiting their performance under sudden disturbances. This challenge underscores the need for robust control algorithms that enhance system resilience and fault tolerance. In this context, we propose an improved proximal policy optimization (PPO) algorithm, termed KPPO, which integrates a composite regularization mechanism combining a threshold-triggered Kullback-Leibler (KL) divergence penalty and an L2 regularization term. Our approach addresses the slow convergence and environmental adaptation issues of traditional PPO in quadrotor flight control tasks. Through extensive physical simulations, we validate that the KPPO algorithm enables rapid policy convergence and effective decision-making in complex environments, significantly improving task execution efficiency and performance.

The fundamental flight principles of a quadrotor UAV rely on adjusting the rotational speeds of its four rotors to control attitude and position. Based on Newton-Euler equations, the quadrotor generates lift and moments through rotor rotations. The total lift force $F$ is given by the sum of individual rotor forces: $$F = F_1 + F_2 + F_3 + F_4$$ where $F_i = C_T \cdot \omega_i^2$, with $C_T$ as the thrust coefficient and $\omega_i$ as the angular velocity of the $i$-th rotor. Attitude control—including pitch, roll, and yaw—is achieved by differential rotor speeds. For instance, pitch control involves varying the speeds of front and rear rotors, while roll control adjusts left and right rotors. The dynamics and kinematics of the quadrotor are described by position and attitude equations. The position dynamics in the inertial frame are expressed as: $$\begin{align*} a_x &= -\frac{F}{m} (\cos\psi \sin\theta \cos\phi + \sin\psi \sin\phi) \\ a_y &= -\frac{F}{m} (\sin\psi \sin\theta \cos\phi – \cos\psi \sin\phi) \\ a_z &= g – \frac{F}{m} \cos\phi \cos\theta \end{align*}$$ where $\phi$, $\theta$, and $\psi$ represent roll, pitch, and yaw angles, respectively, and $m$ is the mass. The attitude dynamics under small-angle approximations are: $$\begin{align*} \dot{\phi} &= \frac{1}{I_{xx}} (\tau_x + qr (I_{yy} – I_{zz}) – J_1 q \Omega) \\ \dot{\theta} &= \frac{1}{I_{yy}} (\tau_y + pr (I_{zz} – I_{xx}) + J_1 p \Omega) \\ \dot{\psi} &= \frac{1}{I_{zz}} (\tau_z + qp (I_{xx} – I_{yy})) \end{align*}$$ Here, $I_{xx}$, $I_{yy}$, $I_{zz}$ are moments of inertia, $\tau_x$, $\tau_y$, $\tau_z$ are external moments, $p$, $q$, $r$ are angular velocities, and $J_1$ is the rotor inertia. The control system structure of a quadrotor includes sensors (e.g., IMU, barometer, GPS), a control unit processing data with algorithms like KPPO, and actuators (motors and drivers) that adjust rotor speeds based on PWM signals. The motor commands are computed as: $$\begin{align*} \text{Motor}_{\text{right1}} &= \text{Thrust}_{\text{cmd}} + \text{Yaw}_{\text{cmd}} + \text{Pitch}_{\text{cmd}} + \text{Roll}_{\text{cmd}} \\ \text{Motor}_{\text{left1}} &= \text{Thrust}_{\text{cmd}} – \text{Yaw}_{\text{cmd}} + \text{Pitch}_{\text{cmd}} – \text{Roll}_{\text{cmd}} \\ \text{Motor}_{\text{right2}} &= \text{Thrust}_{\text{cmd}} – \text{Yaw}_{\text{cmd}} – \text{Pitch}_{\text{cmd}} + \text{Roll}_{\text{cmd}} \\ \text{Motor}_{\text{left2}} &= \text{Thrust}_{\text{cmd}} + \text{Yaw}_{\text{cmd}} – \text{Pitch}_{\text{cmd}} – \text{Roll}_{\text{cmd}} \end{align*}$$ This structure allows the quadrotor to achieve stable flight and adapt to environmental changes.

Deep reinforcement learning (DRL) has emerged as a powerful model-free optimization paradigm for robot control, leveraging deep neural networks to represent high-dimensional features. In quadrotor flight control, value-based and policy-based algorithms are commonly used. We focus on policy-based methods, particularly PPO and deep deterministic policy gradient (DDPG). PPO ensures stable policy updates by limiting the change between consecutive policies. A variant, PPO-penalty (PPO1), uses an adaptive KL penalty to dynamically adjust the trust region. The objective function for PPO1 is: $$J^{\theta_k}_{\text{PPO}} = J^{\theta_k}(\theta) – \beta \text{KL}(\theta, \theta_k)$$ where $J^{\theta_k}(\theta) = \sum_{(s_t,a_t)} \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_k}(a_t|s_t)} A^{\theta_k}(s_t, a_t)$, with $A$ as the advantage function. The KL divergence $\text{KL}(\theta, \theta_k)$ is computed as the expectation over the log ratio of policies, and $\beta$ is adjusted based on whether the KL divergence exceeds a threshold. In contrast, DDPG employs a deterministic policy gradient framework with actor and critic networks. The actor outputs a deterministic action $a = \mu_\theta(s)$ with added noise for exploration: $$a_{\text{explore}} = \mu_\theta(s) + \mathcal{N}(0, \sigma^2)$$ The critic minimizes the temporal difference error: $$L_{\text{Critic}}(\phi) = \mathbb{E}_{(s,a,r,s’) \sim D} \left[ (r + \gamma Q_{\bar{\phi}}(s’, \mu_{\bar{\theta}}(s’)) – Q_\phi(s,a))^2 \right]$$ where $D$ is the replay buffer, and $\gamma$ is the discount factor. The actor loss is: $$L_{\text{Actor}}(\theta) = -\mathbb{E}_{s \sim D} [Q_\phi(s, \mu_\theta(s))]$$ Target networks are updated via soft updates: $\bar{\theta} \leftarrow \tau \theta + (1-\tau)\bar{\theta}$ and $\bar{\phi} \leftarrow \tau \phi + (1-\tau)\bar{\phi}$. While DDPG excels in sample efficiency, it may suffer from convergence issues in complex quadrotor environments.

Our proposed KPPO algorithm enhances the PPO framework by incorporating a gated KL divergence penalty and L2 regularization. The key innovation is the threshold-triggered KL penalty, which activates only when the KL divergence between old and new policies exceeds 0.01, preventing excessive policy shifts while allowing necessary exploration. The KL divergence is computed as: $$D_{\text{KL}} = \mathbb{E} \left[ \log \pi_{\theta_{\text{old}}}(a_t|s_t) – \log \pi_\theta(a_t|s_t) \right]$$ The KL penalty term is defined as: $$L_{\text{KL}} = \partial \cdot \max(D_{\text{KL}} – \delta, 0)$$ where $\partial$ is a hyperparameter controlling the penalty strength, and $\delta = 0.01$ is the threshold. Additionally, we introduce an L2 regularization term to prevent overfitting and stabilize parameters: $$L_{L2} = \sum_i \|\theta_i\|^2$$ The policy loss uses a minimum constraint to ensure stable updates: $$L_{\text{policy}} = \mathbb{E} \left[ -\min(r_t A_t, A_t) \right]$$ where $r_t = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{\text{old}}}(a_t|s_t)}$ is the probability ratio, and $A_t = R_t – V(s_t)$ is the advantage function with $R_t$ as the cumulative discounted reward. The value loss is: $$L_{\text{value}} = \mathbb{E} \left[ (V(s_t) – R_t)^2 \right]$$ The total loss function combines these components: $$L = L_{\text{policy}} + \beta L_{L2} + L_{\text{KL}} + L_{\text{value}}$$ where $\beta$ is the L2 regularization coefficient. This composite regularization mechanism enhances the quadrotor’s adaptability and robustness in dynamic environments.

We evaluate the KPPO algorithm through physical simulations of five distinct quadrotor flight control tasks. The simulation environment is based on the QuadDynamics class, which models quadrotor dynamics using rigid-body equations. The state vector is 12-dimensional: $s = [x, y, z, \phi, \theta, \psi, v_x, v_y, v_z, \omega_x, \omega_y, \omega_z]^T$, encompassing position, attitude angles, linear velocities, and angular velocities. The action space consists of four continuous values representing rotor thrusts. The dynamics update using Euler integration: $$\begin{align*} p_{t+1} &= p_t + v_t \Delta t + \frac{1}{2} a_t \Delta t^2 \\ v_{t+1} &= v_t + a_t \Delta t \\ \phi_{t+1} &= \phi_t + \omega_t \Delta t + \frac{1}{2} \alpha_t \Delta t^2 \\ \omega_{t+1} &= \omega_t + \alpha_t \Delta t \end{align*}$$ where accelerations $a$ and angular accelerations $\alpha$ are derived from rotor thrusts and moments. The physical parameters used in simulations are summarized in Table 1.

Table 1: Physical Parameters for Quadrotor Simulation
Parameter	Value	Unit
Gravity Acceleration	9.81	m/s²
Time Step	0.02	s
Quadrotor Mass	0.665	kg
Rotor Arm Length	0.105	m
Moment of Inertia	[0.0023, 0.0025, 0.0037]	kg·m²

The five simulation tasks are designed to test various aspects of quadrotor control:
1. Vehicle Tracking Task: The quadrotor aims to reach and maintain a target position. The reward function is: $$\text{reward} = – (\text{err}_d + \text{err}_v)$$ where $\text{err}_d = w_1 \cdot (\text{dist\_err} + z_{\text{err}})$ and $\text{err}_v = w_2 \cdot (v_{\text{err}} + v_{z_{\text{err}}})$, with weights $w_1$ and $w_2$.
2. Hover Task: The quadrotor must stabilize at a target position and attitude. The reward is: $$\text{reward} = – (w_1 \cdot \text{pos\_err} + w_2 \cdot \text{angle\_err})$$
3. Height Control Task: The quadrotor controls its altitude to match a target height. The reward is: $$\text{reward} = -w_1 \cdot z_{\text{err}}$$
4. Speed Control Task: The quadrotor maintains a target velocity. The reward is: $$\text{reward} = -w_1 \cdot \text{speed\_err}$$
5. Multi-Target Tracking Task: The quadrotor tracks multiple targets simultaneously. The reward is: $$\text{reward} = -w_1 \cdot \sum_i \text{dist\_err}_i$$
Additional rewards are given for task completion, and penalties for failures. The training parameters for KPPO, PPO, and DDPG are standardized across tasks for fair comparison, as shown in Table 2.

Table 2: Training Parameters for KPPO, PPO, and DDPG Algorithms
Parameter	KPPO/PPO Value	DDPG Value
Mini-batch Size	64	64
Training Epochs	20	N/A
Discount Factor	0.99	0.99
Learning Rate	0.0001	0.0001
Optimizer Coefficients	(0.9, 0.999)	N/A
Task Reward	300	N/A
Extra Reward	1000	N/A
Penalty	-1	N/A
Max Training Steps	1000	N/A
Max Steps per Episode	400	N/A
Model Update Interval	400	N/A
KL Penalty Coefficient	0.2	N/A
L2 Regularization Coefficient	1e-5	N/A
Random Seed	10	N/A
State Dimension	12	12
Action Dimension	4	4
KL Divergence Threshold	0.01	N/A
Target Network Soft Update	N/A	0.001
Action Noise	N/A	0.1

Simulation results demonstrate the superiority of KPPO over PPO and DDPG across multiple tasks. In the speed control task, KPPO achieves higher reward values and faster convergence, as shown in Figure 1 (reward per episode). DDPG initially performs well but plateaus due to limited exploration, while KPPO’s gated KL penalty enables sustained improvement. Similarly, in the hover task, KPPO exhibits significant reward fluctuations early on, indicating active exploration, but stabilizes at higher performance levels. The height control task reveals that KPPO, though slower initially, surpasses other algorithms after approximately 600 episodes, leveraging its regularization mechanisms for robust learning. For multi-target tracking, DDPG shows higher rewards due to its off-policy sample efficiency, but KPPO maintains consistent policy convergence without the declines observed in PPO. The tracking task highlights KPPO’s gradual reward improvement, reflecting enhanced adaptability in complex environments.

To quantify performance, we analyze the average reward over multiple episodes (avg_running_reward), which smooths out random fluctuations. In speed control and hover tasks, KPPO’s average reward consistently outperforms PPO and DDPG. In height control, KPPO’s average reward initially lags but exceeds others after 600 episodes, showcasing its long-term optimization capability. The standard deviation of rewards, summarized in Table 3, indicates that KPPO has higher variability due to exploratory behavior, which is beneficial in complex quadrotor scenarios. In contrast, DDPG’s low deviation suggests confinement to local optima.

Table 3: Total Average Reward and Standard Deviation Comparison
Task	KPPO Average	PPO Average	DDPG Average	KPPO Std Dev	PPO Std Dev	DDPG Std Dev
Height Control	-1322.185	-3472.665	-364.295	4031.546	11.052	0.647
Multi-Target Tracking	-3542.908	-3577.797	-432.059	11.771	143.894	6.944
Speed Control	8879.512	3242.066	1504.431	11845.580	8519.012	3671.278
Hover Task	-90.670	-255.520	-320.119	1764.684	686.389	27.569
Tracking Task	-478.866	-481.655	-497.700	220.065	218.798	8.875

Trajectory analysis further validates KPPO’s effectiveness. In tasks like speed control and hovering, the quadrotor under KPPO control quickly converges to the target within tolerance bounds, ending episodes early. For instance, in hovering, KPPO achieves the goal fastest, while DDPG and PPO show similar, slower convergence. In height control, multi-target tracking, and single-target tracking, the quadrotor fails to reach the target within 400,000 steps, but KPPO’s trajectories exhibit reduced dispersion and faster stabilization compared to PPO and DDPG. The 3D trajectory plots, though overlapping, reveal that KPPO enables more efficient exploration and quicker entry into stable states. This is attributed to the threshold-triggered KL divergence, which balances exploration and exploitation, and the L2 regularization that prevents parameter divergence.

In conclusion, our KPPO algorithm significantly enhances quadrotor flight control by integrating a composite regularization mechanism. The threshold-triggered KL divergence penalty constrains policy updates to avoid instability, while the L2 regularization term improves parameter robustness. Physical simulations across five tasks confirm that KPPO achieves faster convergence, higher task completion rates, and better adaptation to complex environments than PPO and DDPG. For future work, we plan to incorporate prioritized experience replay to boost sample efficiency and employ TD(λ) methods for more stable reward estimation. These improvements will further solidify KPPO’s applicability in real-world quadrotor operations, advancing autonomous control in dynamic scenarios.