Quadrotor Attitude Control via Deep Reinforcement Learning and Cascade PID

In recent years, quadrotor unmanned aerial vehicles have gained significant attention due to their versatility in applications such as surveillance, delivery, and environmental monitoring. However, maintaining stable attitude control for a quadrotor in dynamic and complex environments remains a challenging task. Traditional control methods, including proportional-integral-derivative (PID) algorithms and their variants, model predictive control (MPC), and backstepping techniques, are commonly employed in commercial quadrotor systems. Despite their maturity, these approaches often rely on extensive trial-and-error tuning and expert knowledge, which limits their adaptability to rapidly changing conditions and diverse quadrotor configurations. To address these limitations, we integrate modern machine learning techniques, specifically deep reinforcement learning, into quadrotor attitude control. This paper proposes a hybrid control strategy that combines cascade PID with the deep deterministic policy gradient (DDPG) algorithm to enhance adaptive parameter tuning and stability for quadrotor attitude control. By leveraging the Actor-Critic framework, our method improves learning efficiency and reduces dependency on manual tuning, ultimately achieving superior performance in high-dynamic scenarios.

The core of our approach lies in modeling the quadrotor dynamics and designing a controller that autonomously adjusts PID gains based on environmental feedback. We begin by establishing a comprehensive dynamic model of the quadrotor using Newton-Euler equations, which describe the forces and moments acting on the vehicle. This model forms the foundation for simulating quadrotor behavior and training our reinforcement learning agent. Subsequently, we define the state and action spaces for the DDPG algorithm, where the actions correspond to the adaptive gains of the cascade PID controller. A multi-objective reward function is designed to promote stability, minimize errors, and ensure safe operation during training. Experimental results demonstrate that our DDPG-based cascade PID controller outperforms traditional methods in terms of convergence speed, attitude tracking accuracy, and disturbance rejection. This work highlights the potential of deep reinforcement learning in advancing quadrotor control systems and paves the way for more robust autonomous flight.

To provide a visual reference, the following figure illustrates a typical quadrotor structure, which consists of four rotors arranged in an “X” configuration. This setup allows for controlled movement through variations in rotor speeds, enabling precise attitude adjustments.

Dynamic Modeling of the Quadrotor

Accurate dynamic modeling is essential for effective quadrotor control. We consider an “X”-type quadrotor and define two coordinate systems: the Earth frame (inertial coordinate system) and the Body frame (body-fixed coordinate system). The Earth frame, denoted as $$O_e = (x_e, y_e, z_e)$$, serves as a global reference, while the Body frame, $$O_b = (x_b, y_b, z_b)$$, is attached to the quadrotor’s center of mass. The quadrotor’s attitude is represented by Euler angles $$\Theta = (\phi, \theta, \psi)$$, where $$\phi$$ is the roll angle, $$\theta$$ is the pitch angle, and $$\psi$$ is the yaw angle. Each rotor generates a thrust force $$T_i$$ (for i=1,2,3,4), and the transformation between coordinate systems is achieved through rotation matrices.

Using the Newton-Euler formulation, the equations of motion for the quadrotor can be derived. The translational dynamics in the Earth frame are governed by the following equation:

$$ m \ddot{\mathbf{p}} = \mathbf{F} – m \mathbf{g} + \mathbf{F}_d $$

where $$m$$ is the mass of the quadrotor, $$\mathbf{p} = [x, y, z]^T$$ is the position vector in the Earth frame, $$\mathbf{F}$$ is the total thrust vector in the Body frame, $$\mathbf{g} = [0, 0, g]^T$$ is the gravity vector, and $$\mathbf{F}_d$$ represents external disturbances. The rotational dynamics are described by:

$$ \mathbf{J} \dot{\boldsymbol{\omega}} + \boldsymbol{\omega} \times \mathbf{J} \boldsymbol{\omega} = \boldsymbol{\tau} $$

Here, $$\mathbf{J}$$ is the inertia matrix, $$\boldsymbol{\omega} = [p, q, r]^T$$ is the angular velocity vector in the Body frame, and $$\boldsymbol{\tau} = [\tau_\phi, \tau_\theta, \tau_\psi]^T$$ is the torque vector. The total thrust and torques are related to the rotor thrusts by:

$$ \begin{bmatrix} F \\ \tau_\phi \\ \tau_\theta \\ \tau_\psi \end{bmatrix} = \begin{bmatrix} 1 & 1 & 1 & 1 \\ 0 & -l & 0 & l \\ l & 0 & -l & 0 \\ -c & c & -c & c \end{bmatrix} \begin{bmatrix} T_1 \\ T_2 \\ T_3 \\ T_4 \end{bmatrix} $$

where $$l$$ is the arm length from the center to each rotor, and $$c$$ is a drag coefficient. For simplicity, we assume a symmetric quadrotor with identical rotors, leading to a diagonal inertia matrix $$\mathbf{J} = \text{diag}(J_{xx}, J_{yy}, J_{zz})$$. The kinematics relating the Euler angles to angular velocities are given by:

$$ \dot{\Theta} = \mathbf{R} \boldsymbol{\omega} $$

with $$\mathbf{R}$$ being the transformation matrix. This model captures the essential dynamics of the quadrotor and serves as the basis for controller design and simulation.

Controller Design with DDPG and Cascade PID

To achieve adaptive attitude control for the quadrotor, we propose a hybrid controller that integrates cascade PID with the DDPG algorithm. Cascade PID is widely used in quadrotor systems due to its simplicity and effectiveness in handling inner and outer loop control. The outer loop typically manages position and velocity, while the inner loop regulates attitude and angular rates. However, fixed PID gains may not suffice under dynamic conditions. Our approach uses DDPG to continuously adjust these gains, enhancing the quadrotor’s adaptability.

The DDPG algorithm is a model-free, off-policy reinforcement learning method that operates on continuous action spaces. It employs an Actor-Critic architecture, where the Actor network (policy network) maps states to actions, and the Critic network (value network) evaluates the quality of actions given states. Both networks have target counterparts to stabilize training. In our setup, the state space for the DDPG agent is defined as:

$$ \mathbf{s} = [\mathbf{e}_{pos}, \mathbf{e}_{vel}, \mathbf{e}_{\Theta}, \mathbf{e}_{\omega}, \boldsymbol{\omega}] $$

where $$\mathbf{e}_{pos}$$ is the position error, $$\mathbf{e}_{vel}$$ is the velocity error, $$\mathbf{e}_{\Theta}$$ is the attitude angle error, $$\mathbf{e}_{\omega}$$ is the angular velocity error, and $$\boldsymbol{\omega}$$ is the current angular velocity. This state representation provides comprehensive feedback on the quadrotor’s performance. The action space consists of the PID gains for both inner and outer loops:

$$ \mathbf{a} = [K_{ix}, K_{iy}, K_{iz}, K_{i\phi}, K_{i\theta}, K_{i\psi}] $$

These gains are bounded based on expert experience to ensure initial stability: outer loop gains $$K_p \in (0.5, 20)$$, $$K_i \in (0.001, 0.5)$$, $$K_d \in (0.1, 10)$$, and inner loop gains $$K_p \in (0.1, 10)$$, $$K_d \in (0.01, 5)$$. The Actor network outputs actions according to a deterministic policy $$\mu(\mathbf{s} | \theta^\mu)$$, where $$\theta^\mu$$ are the network parameters. The Critic network estimates the Q-value $$Q(\mathbf{s}, \mathbf{a} | \theta^Q)$$, which guides the policy updates.

Training the DDPG agent requires a well-designed reward function to encourage desired behaviors. We formulate a multi-objective reward function as follows:

$$ R = k_1 R_1 + k_2 R_2 + k_3 R_3 $$

where $$R_1$$ penalizes large errors in position and attitude, $$R_2$$ encourages smooth control by penalizing abrupt changes in angular velocities, and $$R_3$$ ensures that roll and pitch angles remain within safe limits (e.g., $$\pm 30^\circ$$). The weights $$k_i$$ are tuned to balance these objectives. Specifically, $$R_1$$ is defined as:

$$ R_1 = – ( \mathbf{e}_{pos}^T \mathbf{e}_{pos} + \mathbf{e}_{\Theta}^T \mathbf{e}_{\Theta} ) $$

$$R_2$$ is based on the derivative of angular velocities:

$$ R_2 = – \dot{\boldsymbol{\omega}}^T \dot{\boldsymbol{\omega}} $$

and $$R_3$$ imposes a penalty if the angles exceed thresholds:

$$ R_3 = \begin{cases} 0 & \text{if } |\phi| \leq 30^\circ \text{ and } |\theta| \leq 30^\circ \\ -1 & \text{otherwise} \end{cases} $$

This reward structure promotes stable and safe quadrotor operation during learning. The overall control framework is depicted in the following diagram, where the DDPG agent interacts with the quadrotor environment to adjust the cascade PID gains in real-time.

Table 1: DDPG Training Parameters for Quadrotor Attitude Control
Parameter	Symbol	Value
Learning Rate (Actor)	$$\alpha_\mu$$	0.001
Learning Rate (Critic)	$$\alpha_Q$$	0.002
Discount Factor	$$\gamma$$	0.99
Replay Buffer Size	—	1e6
Batch Size	—	64
Soft Update Rate	$$\tau$$	0.005
Exploration Noise	—	Ornstein-Uhlenbeck

The training process involves simulating the quadrotor dynamics in episodes, where each episode terminates if the quadrotor violates safety constraints or completes a trajectory. The DDPG agent collects experiences in the form of tuples $$(\mathbf{s}, \mathbf{a}, r, \mathbf{s}’)$$ and updates its networks using gradient descent. Over time, the agent learns to optimize the PID gains for improved quadrotor performance.

Experimental Setup and Results Analysis

We conducted extensive simulations to evaluate the effectiveness of our DDPG-based cascade PID controller for quadrotor attitude control. The quadrotor model parameters are based on a typical configuration: mass $$m = 1.41 \, \text{kg}$$, arm length $$l = 0.24 \, \text{m}$$, and gravitational acceleration $$g = 9.81 \, \text{m/s}^2$$. The inertia matrix is set to $$\mathbf{J} = \text{diag}(0.05, 0.05, 0.1) \, \text{kg·m}^2$$. Training was performed in a simulated environment built in Python, utilizing TensorFlow for implementing the DDPG networks.

The reward function weights were assigned as $$k_1 = -2$$, $$k_2 = -0.5$$, $$k_3 = -3$$, with additional terms for velocity and angular rate errors. The target attitude for training was set to $$(0^\circ, 0^\circ, 30^\circ)$$ for roll, pitch, and yaw, respectively, and initial attitudes were randomly sampled within safe bounds. Training progressed over 500 episodes, with each episode lasting until a termination condition was met. The total reward per episode was recorded to monitor convergence.

The training curve, shown in the following table as a summary of rewards over episodes, indicates that the DDPG agent successfully learned to maximize rewards. After approximately 150 episodes, the reward stabilized around -45, with minor fluctuations due to exploration noise. This demonstrates the algorithm’s ability to converge to an effective policy for quadrotor control.

Table 2: Training Performance Summary for DDPG-Based Quadrotor Controller
Episode Range	Average Reward	Comments
1-50	-120 to -90	Initial exploration, high errors
51-150	-90 to -50	Rapid improvement in policy
151-380	-50 to -45	Stabilization with minor noise
381-500	-45 to -44	Converged performance

To assess the controller’s robustness, we designed a spiral ascent trajectory tracking experiment. The quadrotor was commanded to follow a path defined by:

$$ x_d(t) = 0.5 t \cos(t), \quad y_d(t) = 0.5 t \sin(t), \quad z_d(t) = 0.1 t $$

with desired yaw $$\psi_d = 30^\circ$$. We compared our DDPG-based cascade PID controller against a conventional cascade PID controller with fixed gains. The results, summarized in the table below, highlight the superiority of our approach in terms of tracking accuracy and response time.

Table 3: Performance Comparison in Spiral Ascent Trajectory Tracking
Metric	Traditional Cascade PID	DDPG-Cascade PID
Rise Time (s)	2.5	1.8
Overshoot (%)	15	5
Steady-State Error	0.05 rad	0.01 rad
Disturbance Rejection	Moderate	High

As observed, the DDPG-enhanced controller achieved faster response with reduced overshoot and nearly zero steady-state error. For instance, the pitch angle tracking showed a smooth convergence to the desired value within 2 seconds, whereas the traditional PID exhibited oscillations. Similarly, the roll angle response demonstrated improved stability under the hybrid controller. After 20 seconds of tracking, the quadrotor successfully entered a hover state, overcoming its weight of 13.82 N, with minimal attitude deviations.

These findings confirm that the DDPG algorithm effectively adapts the PID gains to dynamic conditions, enhancing the quadrotor’s performance. The use of a multi-objective reward function contributed to this robustness by balancing error minimization with control smoothness and safety.

Conclusion and Future Work

In this paper, we presented a novel hybrid control strategy for quadrotor attitude control that combines the stability of cascade PID with the adaptability of deep reinforcement learning. By modeling the quadrotor dynamics using Newton-Euler equations and designing a DDPG-based tuning mechanism, we enabled real-time adjustment of PID gains without relying on expert intervention. The state and action spaces were carefully defined to capture essential quadrotor states and control parameters, while the reward function promoted safe and efficient learning. Experimental results validated our approach, showing significant improvements in convergence speed, tracking accuracy, and disturbance rejection compared to traditional methods.

Looking ahead, we plan to extend this work to address more complex scenarios, such as quadrotor control under center-of-gravity shifts or non-standard configurations. This will involve refining the reward function and exploring transfer learning techniques to enhance the controller’s generalization capabilities. Additionally, implementing the algorithm on physical quadrotor hardware will be crucial for real-world validation. We believe that deep reinforcement learning holds great promise for advancing autonomous quadrotor systems, and our method provides a solid foundation for future research in adaptive control.