Deep Reinforcement Learning for Disturbance Rejection Control in Quadcopters

In recent years, quadcopters have gained significant attention due to their versatility in applications such as aerial photography, logistics, and resource exploration. However, the nonlinear and highly coupled dynamics of quadcopters pose substantial challenges for designing reliable and stable control systems. Traditional control algorithms, while offering good stability, often struggle with disturbance rejection in dynamic environments. To address this limitation, we propose a hybrid control framework that combines nonlinear model predictive control (NMPC) with an improved deep reinforcement learning (DRL) compensator. This approach leverages the strengths of NMPC while enhancing robustness against disturbances through a novel twin delayed deep deterministic policy gradient (TD3) based compensator. Our method incorporates multi-head attention (MA) and long short-term memory (LSTM) networks into the Actor network of TD3, improving its ability to capture spatial and temporal dependencies. Additionally, we introduce a continuous logarithmic reward function to stabilize training and accelerate convergence. Through extensive simulations, we demonstrate the effectiveness of our approach in various scenarios, showing superior performance compared to other DRL algorithms like DDPG, SAC, and PPO.

The quadcopter, as a common type of multi-rotor unmanned aerial vehicle, exhibits underactuated and nonlinear characteristics, making control design complex. We begin by modeling the quadcopter dynamics using Newton-Euler equations, representing attitudes with quaternions to avoid issues like gimbal lock. The state vector includes position, velocity, orientation, and angular velocity, while control inputs are the activation levels of the four rotors. The dynamics are governed by the following equations:

$$ \dot{p} = v $$
$$ \dot{q} = \frac{1}{2} R_{\omega} q $$
$$ \dot{v} = R_E^B a – g $$
$$ \dot{\omega}_x = \frac{1}{J_x} \left( \tau_x + (J_y – J_z) \omega_y \omega_z \right) $$
$$ \dot{\omega}_y = \frac{1}{J_y} \left( \tau_y + (J_z – J_x) \omega_z \omega_x \right) $$
$$ \dot{\omega}_z = \frac{1}{J_z} \left( \tau_z + (J_x – J_y) \omega_x \omega_y \right) $$

Here, $p$ denotes position, $v$ velocity, $q$ quaternion orientation, $\omega$ angular velocity, $J$ moments of inertia, $\tau$ torques, and $g$ gravity. The rotation matrix $R_E^B$ and angular velocity matrix $R_{\omega}$ are derived from quaternion representations. This model forms the basis for our NMPC controller, which optimizes control inputs over a finite horizon while handling constraints.

For the NMPC design, we discretize the nonlinear system using a sampling time $\Delta t$, resulting in a discrete-time model:

$$ X_{k+1} = X_k + \Delta t \cdot f(X_k, U_k) $$
$$ Y_k = X_k $$

The optimal control problem at each time step $k$ minimizes a cost function that penalizes deviations from the reference trajectory and control effort:

$$ \min \frac{1}{2} \sum_{i=1}^{N} \| Y_{k+i} – \bar{Y}_{k+i} \|^2_{Q_t} + \| U_{k+i-1} \|^2_{Q_r} $$

subject to:

$$ X_{k+1} = X_k + \Delta t \cdot f(X_k, U_k) $$
$$ Y_k = X_k $$
$$ X_0 = X_{\text{init}} $$
$$ u_{\min} \leq u \leq u_{\max} $$

We use CasADi and ACADOS with the HPIPM solver for efficient real-time optimization, employing a Gauss-Newton approximation and fourth-order Runge-Kutta integration. This setup ensures that the NMPC controller provides stable baseline performance but may falter under strong disturbances, necessitating the DRL compensator.

Our improved TD3-based compensator addresses disturbance rejection by generating control adjustments $\Delta u_i$ for the quadcopter’s rotors. The state space for the DRL agent combines leader and follower quadcopter states, including position, Euler angles, velocity, and angular velocity, resulting in a 24-dimensional vector:

$$ S = (S_l, S_c) \in \mathbb{R}^{24} $$

The action space consists of compensation values for the four rotors:

$$ a = [\Delta u_1, \Delta u_2, \Delta u_3, \Delta u_4] $$

with $\Delta u_i \in [-1/5, 1/5]$, and the final control input is $u_i = u_i + \Delta u_i$. The reward function is designed as a bounded continuous logarithmic function to enhance training stability:

$$ \text{reward} = -\sum_{i=1}^{4} \lambda_i (\ln d_i – \ln b_i) $$

where $d_i$ represents the Euclidean distance between leader and follower states for position, Euler angles, velocity, and angular velocity; $\lambda_i$ are scaling factors; and $b_i$ are bounds. This function maps errors to a unified scale, promoting smoother learning.

Training involves random task scenarios with stochastic disturbances to improve generalization. Disturbances are applied with a 5% probability at each time step for a fixed duration, generated as $u_k = u_k + \varepsilon_k$, where $\varepsilon_k \sim \mathcal{U}(m_1, m_2)$. Reference trajectories are randomized using spherical coordinates with uniform sampling for initial positions and angles, ensuring diverse exploration.

The MALSTM-TD3 architecture enhances the TD3 Actor network by integrating multi-head attention and LSTM. The state input is divided into four feature groups (position, orientation, velocity, angular velocity), each processed by fully connected layers to 128-dimensional vectors. These are fed into a multi-head attention block with four heads, producing a $1 \times 4 \times 128$ output. An LSTM layer then captures temporal dependencies, followed by fully connected layers that output the action vector. This design improves the quadcopter’s ability to handle spatial and temporal correlations in dynamic environments.

We conduct simulations to evaluate our approach, comparing NMPC-MALSTM-TD3 against variants using DDPG, SAC, TD3, and PPO as compensators. The quadcopter parameters are summarized in Table 1:

Parameter	Value	Unit
Mass ($m$)	1.0	kg
Max Thrust ($F_m$)	20	N
Moment of Inertia ($J_x$, $J_y$)	0.03	kg·m²
Moment of Inertia ($J_z$)	0.06	kg·m²
Arm Length ($l$)	0.235	m
Torque Coefficient ($c$)	0.013	m
Gravity ($g$)	9.81	m/s²

Training parameters for MALSTM-TD3 are listed in Table 2:

Parameter	Value
Discount Factor ($\gamma$)	0.99
Learning Rate	0.001
Replay Buffer Size	1,000,000
Batch Size	128
Exploration Noise	0.1
Target Policy Noise Std	0.15
Noise Clip Threshold	0.35
Target Update Rate ($\tau$)	0.005
Max Training Episodes	3,200
Max Steps per Episode	200

In experiments, we test the quadcopter on hexagonal, circular, and square trajectories with disturbances applied along the x-axis. The NMPC-MALSTM-TD3 controller shows the best overall performance, particularly in maintaining z-axis position stability. For instance, in the hexagonal trajectory task, the average state errors are compared in Table 3:

Model	x (m)	y (m)	z (m)	$\phi$ (rad)	$\theta$ (rad)	$\psi$ (rad)
TD3	0.3888	0.3753	0.6474	0.3402	0.3313	2.1942
DDPG	1.5957	1.8181	2.6195	0.9255	0.9144	3.8705
SAC	0.4064	0.4277	3.3927	0.3568	0.4031	0.1561
MALSTM-TD3	0.4056	0.4006	0.3303	0.2892	0.4030	1.1201

MALSTM-TD3 achieves lower errors in critical dimensions like z-position and roll angle, demonstrating its robustness. Training curves reveal that MALSTM-TD3 converges faster and more stably than other algorithms, with higher cumulative rewards. Real-time performance tests on a CPU platform show that the addition of the MALSTM-TD3 compensator increases computation time by less than 10%, maintaining feasibility for real-world quadcopter applications. The average runtime for NMPC is 2.11 seconds per episode, while NMPC-MALSTM-TD3 takes 2.29 seconds, both within acceptable limits for real-time control.

In conclusion, our hybrid control strategy effectively enhances disturbance rejection in quadcopters without compromising the stability of traditional NMPC. The integration of MA and LSTM into TD3 improves spatial and temporal learning, while the novel reward function and training scenarios boost generalization. Future work will focus on developing compensators that do not require pre-recorded leader states, increasing flexibility for diverse quadcopter missions. This approach holds promise for broader applications in autonomous systems where robustness to disturbances is critical.