Quadrotor Flight Control Using KPPO Algorithm

In recent years, the rapid advancement of intelligent unmanned systems has led to the widespread deployment of robots in high-risk and complex scenarios, such as industrial inspection, emergency rescue, and military reconnaissance. Among aerial platforms, quadrotor unmanned aerial vehicles (UAVs) have demonstrated remarkable capabilities in applications like aerial remote sensing and disaster detection, owing to their autonomous target search abilities. However, existing flight control systems often struggle with adaptability in dynamic and uncertain environments, limiting their performance under sudden disturbances. This challenge underscores the need for robust control algorithms that enhance system resilience and fault tolerance. In this context, we propose an improved proximal policy optimization (PPO) algorithm, termed KPPO, which integrates a composite regularization mechanism combining a threshold-triggered Kullback-Leibler (KL) divergence penalty and an L2 regularization term. Our approach addresses the slow convergence and environmental adaptation issues of traditional PPO in quadrotor flight control tasks. Through extensive physical simulations, we validate that the KPPO algorithm enables rapid policy convergence and effective decision-making in complex environments, significantly improving task execution efficiency and performance.

The fundamental flight principles of a quadrotor UAV rely on adjusting the rotational speeds of its four rotors to control attitude and position. Based on Newton-Euler equations, the quadrotor generates lift and moments through rotor rotations. The total lift force $F$ is given by the sum of individual rotor forces: $$F = F_1 + F_2 + F_3 + F_4$$ where $F_i = C_T \cdot \omega_i^2$, with $C_T$ as the thrust coefficient and $\omega_i$ as the angular velocity of the $i$-th rotor. Attitude control—including pitch, roll, and yaw—is achieved by differential rotor speeds. For instance, pitch control involves varying the speeds of front and rear rotors, while roll control adjusts left and right rotors. The dynamics and kinematics of the quadrotor are described by position and attitude equations. The position dynamics in the inertial frame are expressed as: $$\begin{align*} a_x &= -\frac{F}{m} (\cos\psi \sin\theta \cos\phi + \sin\psi \sin\phi) \\ a_y &= -\frac{F}{m} (\sin\psi \sin\theta \cos\phi – \cos\psi \sin\phi) \\ a_z &= g – \frac{F}{m} \cos\phi \cos\theta \end{align*}$$ where $\phi$, $\theta$, and $\psi$ represent roll, pitch, and yaw angles, respectively, and $m$ is the mass. The attitude dynamics under small-angle approximations are: $$\begin{align*} \dot{\phi} &= \frac{1}{I_{xx}} (\tau_x + qr (I_{yy} – I_{zz}) – J_1 q \Omega) \\ \dot{\theta} &= \frac{1}{I_{yy}} (\tau_y + pr (I_{zz} – I_{xx}) + J_1 p \Omega) \\ \dot{\psi} &= \frac{1}{I_{zz}} (\tau_z + qp (I_{xx} – I_{yy})) \end{align*}$$ Here, $I_{xx}$, $I_{yy}$, $I_{zz}$ are moments of inertia, $\tau_x$, $\tau_y$, $\tau_z$ are external moments, $p$, $q$, $r$ are angular velocities, and $J_1$ is the rotor inertia. The control system structure of a quadrotor includes sensors (e.g., IMU, barometer, GPS), a control unit processing data with algorithms like KPPO, and actuators (motors and drivers) that adjust rotor speeds based on PWM signals. The motor commands are computed as: $$\begin{align*} \text{Motor}_{\text{right1}} &= \text{Thrust}_{\text{cmd}} + \text{Yaw}_{\text{cmd}} + \text{Pitch}_{\text{cmd}} + \text{Roll}_{\text{cmd}} \\ \text{Motor}_{\text{left1}} &= \text{Thrust}_{\text{cmd}} – \text{Yaw}_{\text{cmd}} + \text{Pitch}_{\text{cmd}} – \text{Roll}_{\text{cmd}} \\ \text{Motor}_{\text{right2}} &= \text{Thrust}_{\text{cmd}} – \text{Yaw}_{\text{cmd}} – \text{Pitch}_{\text{cmd}} + \text{Roll}_{\text{cmd}} \\ \text{Motor}_{\text{left2}} &= \text{Thrust}_{\text{cmd}} + \text{Yaw}_{\text{cmd}} – \text{Pitch}_{\text{cmd}} – \text{Roll}_{\text{cmd}} \end{align*}$$ This structure allows the quadrotor to achieve stable flight and adapt to environmental changes.

Deep reinforcement learning (DRL) has emerged as a powerful model-free optimization paradigm for robot control, leveraging deep neural networks to represent high-dimensional features. In quadrotor flight control, value-based and policy-based algorithms are commonly used. We focus on policy-based methods, particularly PPO and deep deterministic policy gradient (DDPG). PPO ensures stable policy updates by limiting the change between consecutive policies. A variant, PPO-penalty (PPO1), uses an adaptive KL penalty to dynamically adjust the trust region. The objective function for PPO1 is: $$J^{\theta_k}_{\text{PPO}} = J^{\theta_k}(\theta) – \beta \text{KL}(\theta, \theta_k)$$ where $J^{\theta_k}(\theta) = \sum_{(s_t,a_t)} \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_k}(a_t|s_t)} A^{\theta_k}(s_t, a_t)$, with $A$ as the advantage function. The KL divergence $\text{KL}(\theta, \theta_k)$ is computed as the expectation over the log ratio of policies, and $\beta$ is adjusted based on whether the KL divergence exceeds a threshold. In contrast, DDPG employs a deterministic policy gradient framework with actor and critic networks. The actor outputs a deterministic action $a = \mu_\theta(s)$ with added noise for exploration: $$a_{\text{explore}} = \mu_\theta(s) + \mathcal{N}(0, \sigma^2)$$ The critic minimizes the temporal difference error: $$L_{\text{Critic}}(\phi) = \mathbb{E}_{(s,a,r,s’) \sim D} \left[ (r + \gamma Q_{\bar{\phi}}(s’, \mu_{\bar{\theta}}(s’)) – Q_\phi(s,a))^2 \right]$$ where $D$ is the replay buffer, and $\gamma$ is the discount factor. The actor loss is: $$L_{\text{Actor}}(\theta) = -\mathbb{E}_{s \sim D} [Q_\phi(s, \mu_\theta(s))]$$ Target networks are updated via soft updates: $\bar{\theta} \leftarrow \tau \theta + (1-\tau)\bar{\theta}$ and $\bar{\phi} \leftarrow \tau \phi + (1-\tau)\bar{\phi}$. While DDPG excels in sample efficiency, it may suffer from convergence issues in complex quadrotor environments.

Our proposed KPPO algorithm enhances the PPO framework by incorporating a gated KL divergence penalty and L2 regularization. The key innovation is the threshold-triggered KL penalty, which activates only when the KL divergence between old and new policies exceeds 0.01, preventing excessive policy shifts while allowing necessary exploration. The KL divergence is computed as: $$D_{\text{KL}} = \mathbb{E} \left[ \log \pi_{\theta_{\text{old}}}(a_t|s_t) – \log \pi_\theta(a_t|s_t) \right]$$ The KL penalty term is defined as: $$L_{\text{KL}} = \partial \cdot \max(D_{\text{KL}} – \delta, 0)$$ where $\partial$ is a hyperparameter controlling the penalty strength, and $\delta = 0.01$ is the threshold. Additionally, we introduce an L2 regularization term to prevent overfitting and stabilize parameters: $$L_{L2} = \sum_i \|\theta_i\|^2$$ The policy loss uses a minimum constraint to ensure stable updates: $$L_{\text{policy}} = \mathbb{E} \left[ -\min(r_t A_t, A_t) \right]$$ where $r_t = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{\text{old}}}(a_t|s_t)}$ is the probability ratio, and $A_t = R_t – V(s_t)$ is the advantage function with $R_t$ as the cumulative discounted reward. The value loss is: $$L_{\text{value}} = \mathbb{E} \left[ (V(s_t) – R_t)^2 \right]$$ The total loss function combines these components: $$L = L_{\text{policy}} + \beta L_{L2} + L_{\text{KL}} + L_{\text{value}}$$ where $\beta$ is the L2 regularization coefficient. This composite regularization mechanism enhances the quadrotor’s adaptability and robustness in dynamic environments.

We evaluate the KPPO algorithm through physical simulations of five distinct quadrotor flight control tasks. The simulation environment is based on the QuadDynamics class, which models quadrotor dynamics using rigid-body equations. The state vector is 12-dimensional: $s = [x, y, z, \phi, \theta, \psi, v_x, v_y, v_z, \omega_x, \omega_y, \omega_z]^T$, encompassing position, attitude angles, linear velocities, and angular velocities. The action space consists of four continuous values representing rotor thrusts. The dynamics update using Euler integration: $$\begin{align*} p_{t+1} &= p_t + v_t \Delta t + \frac{1}{2} a_t \Delta t^2 \\ v_{t+1} &= v_t + a_t \Delta t \\ \phi_{t+1} &= \phi_t + \omega_t \Delta t + \frac{1}{2} \alpha_t \Delta t^2 \\ \omega_{t+1} &= \omega_t + \alpha_t \Delta t \end{align*}$$ where accelerations $a$ and angular accelerations $\alpha$ are derived from rotor thrusts and moments. The physical parameters used in simulations are summarized in Table 1.

Table 1: Physical Parameters for Quadrotor Simulation
Parameter Value Unit
Gravity Acceleration 9.81 m/s²
Time Step 0.02 s
Quadrotor Mass 0.665 kg
Rotor Arm Length 0.105 m
Moment of Inertia [0.0023, 0.0025, 0.0037] kg·m²

The five simulation tasks are designed to test various aspects of quadrotor control:
1. Vehicle Tracking Task: The quadrotor aims to reach and maintain a target position. The reward function is: $$\text{reward} = – (\text{err}_d + \text{err}_v)$$ where $\text{err}_d = w_1 \cdot (\text{dist\_err} + z_{\text{err}})$ and $\text{err}_v = w_2 \cdot (v_{\text{err}} + v_{z_{\text{err}}})$, with weights $w_1$ and $w_2$.
2. Hover Task: The quadrotor must stabilize at a target position and attitude. The reward is: $$\text{reward} = – (w_1 \cdot \text{pos\_err} + w_2 \cdot \text{angle\_err})$$
3. Height Control Task: The quadrotor controls its altitude to match a target height. The reward is: $$\text{reward} = -w_1 \cdot z_{\text{err}}$$
4. Speed Control Task: The quadrotor maintains a target velocity. The reward is: $$\text{reward} = -w_1 \cdot \text{speed\_err}$$
5. Multi-Target Tracking Task: The quadrotor tracks multiple targets simultaneously. The reward is: $$\text{reward} = -w_1 \cdot \sum_i \text{dist\_err}_i$$
Additional rewards are given for task completion, and penalties for failures. The training parameters for KPPO, PPO, and DDPG are standardized across tasks for fair comparison, as shown in Table 2.

Table 2: Training Parameters for KPPO, PPO, and DDPG Algorithms
Parameter KPPO/PPO Value DDPG Value
Mini-batch Size 64 64
Training Epochs 20 N/A
Discount Factor 0.99 0.99
Learning Rate 0.0001 0.0001
Optimizer Coefficients (0.9, 0.999) N/A
Task Reward 300 N/A
Extra Reward 1000 N/A
Penalty -1 N/A
Max Training Steps 1000 N/A
Max Steps per Episode 400 N/A
Model Update Interval 400 N/A
KL Penalty Coefficient 0.2 N/A
L2 Regularization Coefficient 1e-5 N/A
Random Seed 10 N/A
State Dimension 12 12
Action Dimension 4 4
KL Divergence Threshold 0.01 N/A
Target Network Soft Update N/A 0.001
Action Noise N/A 0.1

Simulation results demonstrate the superiority of KPPO over PPO and DDPG across multiple tasks. In the speed control task, KPPO achieves higher reward values and faster convergence, as shown in Figure 1 (reward per episode). DDPG initially performs well but plateaus due to limited exploration, while KPPO’s gated KL penalty enables sustained improvement. Similarly, in the hover task, KPPO exhibits significant reward fluctuations early on, indicating active exploration, but stabilizes at higher performance levels. The height control task reveals that KPPO, though slower initially, surpasses other algorithms after approximately 600 episodes, leveraging its regularization mechanisms for robust learning. For multi-target tracking, DDPG shows higher rewards due to its off-policy sample efficiency, but KPPO maintains consistent policy convergence without the declines observed in PPO. The tracking task highlights KPPO’s gradual reward improvement, reflecting enhanced adaptability in complex environments.

To quantify performance, we analyze the average reward over multiple episodes (avg_running_reward), which smooths out random fluctuations. In speed control and hover tasks, KPPO’s average reward consistently outperforms PPO and DDPG. In height control, KPPO’s average reward initially lags but exceeds others after 600 episodes, showcasing its long-term optimization capability. The standard deviation of rewards, summarized in Table 3, indicates that KPPO has higher variability due to exploratory behavior, which is beneficial in complex quadrotor scenarios. In contrast, DDPG’s low deviation suggests confinement to local optima.

Table 3: Total Average Reward and Standard Deviation Comparison
Task KPPO Average PPO Average DDPG Average KPPO Std Dev PPO Std Dev DDPG Std Dev
Height Control -1322.185 -3472.665 -364.295 4031.546 11.052 0.647
Multi-Target Tracking -3542.908 -3577.797 -432.059 11.771 143.894 6.944
Speed Control 8879.512 3242.066 1504.431 11845.580 8519.012 3671.278
Hover Task -90.670 -255.520 -320.119 1764.684 686.389 27.569
Tracking Task -478.866 -481.655 -497.700 220.065 218.798 8.875

Trajectory analysis further validates KPPO’s effectiveness. In tasks like speed control and hovering, the quadrotor under KPPO control quickly converges to the target within tolerance bounds, ending episodes early. For instance, in hovering, KPPO achieves the goal fastest, while DDPG and PPO show similar, slower convergence. In height control, multi-target tracking, and single-target tracking, the quadrotor fails to reach the target within 400,000 steps, but KPPO’s trajectories exhibit reduced dispersion and faster stabilization compared to PPO and DDPG. The 3D trajectory plots, though overlapping, reveal that KPPO enables more efficient exploration and quicker entry into stable states. This is attributed to the threshold-triggered KL divergence, which balances exploration and exploitation, and the L2 regularization that prevents parameter divergence.

In conclusion, our KPPO algorithm significantly enhances quadrotor flight control by integrating a composite regularization mechanism. The threshold-triggered KL divergence penalty constrains policy updates to avoid instability, while the L2 regularization term improves parameter robustness. Physical simulations across five tasks confirm that KPPO achieves faster convergence, higher task completion rates, and better adaptation to complex environments than PPO and DDPG. For future work, we plan to incorporate prioritized experience replay to boost sample efficiency and employ TD(λ) methods for more stable reward estimation. These improvements will further solidify KPPO’s applicability in real-world quadrotor operations, advancing autonomous control in dynamic scenarios.

Scroll to Top