Enhanced Deep Deterministic Policy Gradient for Adaptive Quadrotor Trajectory Control

In the rapidly evolving field of drone technology, precise autonomous control is a cornerstone for mission success in complex environments. Traditional Quadrotor controllers often rely on manual parameter tuning, which is not only labor-intensive but also yields fixed parameters that lack adaptability, leading to suboptimal performance. This study proposes an enhanced Deep Deterministic Policy Gradient (DDPG) strategy to adaptively optimize critical controller parameters. By fusing a novel reward function with a hybrid prioritized experience replay mechanism, the proposed method significantly improves training efficiency, tracking accuracy, and robustness against external disturbances, advancing the capabilities of modern drone technology.

1. Quadrotor Dynamic Model

To develop a robust control system, we first establish a comprehensive dynamic model of the Quadrotor Unmanned Aerial Vehicle (UAV). Assuming structural symmetry and negligible aerodynamic effects at low altitudes, the Newton-Euler formalism yields the following equations of motion for position and attitude:

$$
\begin{aligned}
\ddot{x} &= \frac{U_1}{m}(\cos\phi\sin\theta\cos\psi+\sin\phi\sin\psi)-\frac{k_1}{m}\dot{x}+d_x \\
\ddot{y} &= \frac{U_1}{m}(\cos\phi\sin\theta\sin\psi-\sin\phi\cos\psi)-\frac{k_2}{m}\dot{y}+d_y \\
\ddot{z} &= \frac{U_1}{m}\cos\phi\cos\theta-g-\frac{k_3}{m}\dot{z}+d_z \\
\ddot{\phi} &= \frac{J_{yy}-J_{zz}}{J_{xx}}\dot{\theta}\dot{\psi}-\frac{J_r}{J_{xx}}\dot{\theta}\Omega_e+\frac{l}{J_{xx}}U_2-\frac{k_4}{J_{xx}}\dot{\phi}+d_\phi \\
\ddot{\theta} &= \frac{J_{zz}-J_{xx}}{J_{yy}}\dot{\phi}\dot{\psi}+\frac{J_r}{J_{yy}}\dot{\phi}\Omega_e+\frac{l}{J_{yy}}U_3-\frac{k_5}{J_{yy}}\dot{\theta}+d_\theta \\
\ddot{\psi} &= \frac{J_{xx}-J_{yy}}{J_{zz}}\dot{\phi}\dot{\theta}+\frac{1}{J_{zz}}U_4-\frac{k_6}{J_{zz}}\dot{\psi}+d_\psi
\end{aligned}
$$

where m is the mass, g is gravity, J represents moments of inertia, l is the arm length, U₁ to U₄ are control inputs (total thrust and torques), and d symbolizes lumped disturbances including model uncertainties and external wind.

**Table 1: Key Parameters of the Quadrotor Model**
Parameter	Symbol	Value	Unit
Mass	m	1.5	kg
Arm length	l	0.225	m
Roll inertia	J_xx	0.03213	kg·m²
Pitch inertia	J_yy	0.03213	kg·m²
Yaw inertia	J_zz	0.06426	kg·m²
Gravity	g	9.81	m/s²
Thrust coefficient	C_T	52.38×10⁻⁶	N/(rad·s)

2. Control System Design

2.1 Attitude Control using Backstepping Sliding Mode

For attitude tracking, a Backstepping Sliding Mode (BSM) controller is designed. Taking the roll channel as an example, the control law is derived as:

$$ U_2 = \frac{J_{xx}}{l}\left[ \dot{x}_{2d} – c_1\dot{e}_1 – \rho_2\text{sat}(s_2) – \varepsilon_2s_2 + \frac{J_{yy}-J_{zz}}{J_{xx}}\dot{\theta}\dot{\psi} + \frac{J_r}{J_{xx}}\dot{\theta}\Omega_e + d_\phi \right] $$

where c₁, ρ₂, ε₂ are positive constants. The saturation function sat(s) replaces the sign function to suppress chattering. Similar structures apply to pitch and yaw channels, with the complete set of tunable gains denoted as:

$$ \mathbf{c} = [c_1, c_2, c_3]^T, \quad \boldsymbol{\rho} = [\rho_1, \rho_2, \rho_3]^T, \quad \boldsymbol{\varepsilon} = [\varepsilon_1, \varepsilon_2, \varepsilon_3]^T $$

2.2 Position Control using PID with Feedforward

The outer position loop adopts a cascaded PID structure augmented with desired acceleration feedforward:

$$ u_x = K_{px}e_x + K_{ix}\int_0^t e_x d\tau + K_{dx}\dot{e}_x + \ddot{x}_d $$
$$ u_y = K_{py}e_y + K_{iy}\int_0^t e_y d\tau + K_{dy}\dot{e}_y + \ddot{y}_d $$
$$ u_z = K_{pz}e_z + K_{iz}\int_0^t e_z d\tau + K_{dz}\dot{e}_z + \ddot{z}_d $$

Here K_p, K_i, K_d are the proportional, integral, and derivative gains for each axis.

2.3 Tracking Differentiator

A third-order finite-time convergent tracking differentiator is employed to provide smooth derivatives of the desired attitude angles, enhancing the transient response of the attitude controller.

3. Improved DDPG Algorithm for Adaptive Tuning

The core contribution of this work lies in the enhanced DDPG framework, which optimizes a total of 18 controller parameters (6 PID gains for position, 9 BSM gains for attitude, and 3 additional coefficients). The state space is defined as:

$$ \mathbf{S} = [\mathbf{e}_p, \mathbf{e}_v, \mathbf{e}_a, \mathbf{e}_{\dot{a}}, \mathbf{e}_i] \in \mathbb{R}^{15} $$

where e_p = [e_x, e_y, e_z] are position errors, e_v velocity errors, e_a = [e_φ, e_θ, e_ψ] attitude errors, e_ȧ angular velocity errors, and e_i the integral of position errors. The action space is:

$$ \mathbf{A} = [K_{px},K_{ix},K_{dx},K_{py},K_{iy},K_{dy},K_{pz},K_{iz},K_{dz},c_1,c_2,c_3,\rho_1,\rho_2,\rho_3,\varepsilon_1,\varepsilon_2,\varepsilon_3] \in \mathbb{R}^{18} $$

3.1 Structured Reward Function

A novel reward function is designed that integrates dynamic integral weighting and velocity-sensitive attitude error penalties:

Velocity-sensitive attitude penalty weight:

$$ w_a = w_{a0}\left(1 + \beta \|\mathbf{e}_v\|^2\right) $$

where β is a sensitivity coefficient. This penalizes attitude deviations more heavily during high-speed maneuvers, ensuring stability.

Dynamic integral weight:

$$ w_i = \begin{cases}
w_{i,\min} + (w_{i,\max} – w_{i,\min}) \cdot \frac{\|\mathbf{e}_i\|}{\kappa}, & \|\mathbf{e}_i\| < \kappa \\
w_{i,\max}, & \|\mathbf{e}_i\| \geq \kappa
\end{cases} $$

where κ is a threshold. This prevents integral windup while providing adaptive penalization of steady-state errors.

Total reward at each time step:

$$ R_t = R_p + R_v + R_a + R_u + R_i + R_s $$

where R_p = −w_p∥e_p∥², R_v = −w_v∥e_v∥², R_a = −w_a∥e_a∥², R_u = −w_u∥u∥², R_i = −w_i∥e_i∥², and R_s provides a terminal reward for success or failure.

**Table 2: Reward Function Weight Parameters**
Parameter	Description	Value
w_p	Position error weight	2.0
w_v	Velocity error weight	0.5
w_a0	Base attitude error weight	1.5
β	Velocity sensitivity factor	0.1
w_u	Control effort weight	0.01
w_i,min	Min integral weight	0.2
w_i,max	Max integral weight	1.0
κ	Integral threshold	0.5

3.2 Hybrid Prioritized Experience Replay (HPR)

To improve sample efficiency, we propose the Hybrid Prioritized Replay (HPR) mechanism, combining temporal-difference (TD) error and trajectory sparsity:

$$ P_i = \lambda \frac{|\delta_i|}{\sum_j |\delta_j|} + (1-\lambda) \frac{R_i}{\sum_j R_j} $$

where δ_i is the TD error of the i-th experience, R_i is the trajectory sparsity score (e.g., inverse visitation frequency), and λ ∈ [0,1] balances the two criteria. This ensures that both high-value samples (large TD error) and rare yet informative samples are replayed more frequently.

3.3 Training Procedure

The Actor-Critic networks are trained with the HPR-sampled mini-batches. The critic loss is:

$$ L(\omega) = \frac{1}{B}\sum_{i=1}^B \left( y_i – Q(s_i,a_i;\omega) \right)^2 $$

where the TD target is:

$$ y_i = r_i + \gamma Q'(s_i’, a_i’; \omega’) $$

The actor is updated via gradient ascent:

$$ \nabla_\theta J \approx \frac{1}{B}\sum_{i=1}^B \nabla_a Q(s_i,a;\omega)\big|_{a=\pi(s_i;\theta)} \nabla_\theta \pi(s_i;\theta) $$

Target networks are softly updated: θ’ ← τθ + (1-τ)θ’, ω’ ← τω + (1-τ)ω’ with τ = 0.001.

To encourage exploration, Ornstein-Uhlenbeck noise is added to the actions:

$$ \tilde{a}_t = \pi(s_t;\theta) + \mathcal{N}_t, \quad \mathcal{N}_t \leftarrow \varphi\mathcal{N}_{t-1} + \mathcal{N}(0,\sigma^2) $$

**Table 3: Training Hyperparameters of the Improved DDPG**
Hyperparameter	Value
Discount factor γ	0.99
Actor learning rate	0.001
Critic learning rate	0.0001
HPR balance factor λ	0.2
Replay buffer capacity	100,000
Batch size	64
Soft update rate τ	0.001
Noise variance σ	0.3
Noise retention φ	0.15

4. Simulation Results and Analysis

All simulations are conducted in MATLAB R2022b. Three benchmark experiments—hovering, spiral trajectory tracking, and disturbance rejection—validate the proposed method against the standard DDPG.

4.1 Hovering Experiment

The Quadrotor starts at (0,0,0) m and must reach and hold (0.5,0.5,0.8) m with yaw 0.7 rad. Figure A (not shown) compares the average episode reward during training. The improved DDPG converges 21.4% faster (by episode 458) and achieves an 8.33% higher final reward (1978) with 29.9% faster noise decay. The position response (Figure B) shows that the improved DDPG eliminates overshoot in the x-channel and reduces y-channel overshoot from 4.46% to near zero, while settling time is shortened by 1.2 s.

4.2 Spiral Trajectory Tracking

The reference trajectory is:

$$ x_d = (1+0.1t)\cos(0.5t), \quad y_d = (4+0.1t)\sin(0.5t), \quad z_d = 0.1t+0.2, \quad \psi_d = 1 $$

Three-dimensional tracking (Figure C) shows that the improved DDPG closely follows the desired path, whereas the standard DDPG exhibits visible deviation, especially in the fast-expanding turns. The attitude response (Figure D) reveals that the velocity-sensitive penalty reduces roll/pitch tracking error by up to 27.8% during high-speed segments (e.g., 31–33 s). The position error integral (Figure E) demonstrates that the dynamic integral weight reduces accumulated errors in x and y channels by 11.8% and 10.3%, respectively.

4.3 Disturbance Rejection

Two types of disturbances are applied:

Gyroscope noise: Gaussian white noise with σ = 0.15 added to attitude measurements.
Wind gust: Pulse signals of amplitude 0.8 m/s² (x), 0.6 m/s² (y), 0.4 m/s² (z) injected at t = 20 s for 2 s.

**Table 4: Disturbance Rejection Performance Comparison**
Disturbance Type	Metric	Standard DDPG	Improved DDPG	Change (%)
Gyroscope noise (σ²=0.15)	RMSE (rad)	0.0198	0.0143	-27.78
	Mean amplitude (rad)	0.034	0.021	-38.23
	Mean frequency (Hz)	5.53	3.35	-39.42
Wind gust pulse	Max position deviation (m)	0.75	0.37	-50.66
Wind gust pulse	Recovery time (s)	13.6	6.26	-53.97

These results confirm that the trained policy from the improved DDPG is significantly more robust to both sensor noise and external wind gusts. The dynamic weight on attitude errors ensures tight attitude control even under noise, while the adaptive integral action aids quick recovery after pulse disturbances.

5. Conclusion

This study presents an enhanced DDPG-based adaptive control framework for quadrotor UAVs. By incorporating a tailored reward function with velocity-sensitive attitude penalties and dynamic integral weights, and by introducing the Hybrid Prioritized Replay mechanism that balances TD error and trajectory sparsity, the training efficiency and final control performance are substantially improved over the standard DDPG. Simulation results across hovering, high-dynamic spiral tracking, and disturbance rejection tasks consistently demonstrate faster convergence, higher tracking accuracy (up to 27.8% reduction in attitude error), and superior robustness (up to 53.97% reduction in recovery time). These advancements contribute meaningfully to the field of drone technology, paving the way for real-time adaptive control in challenging operational conditions. Future work will focus on sim-to-real transfer to validate the approach on physical platforms.