In the rapidly evolving field of drone technology, precise autonomous control is a cornerstone for mission success in complex environments. Traditional Quadrotor controllers often rely on manual parameter tuning, which is not only labor-intensive but also yields fixed parameters that lack adaptability, leading to suboptimal performance. This study proposes an enhanced Deep Deterministic Policy Gradient (DDPG) strategy to adaptively optimize critical controller parameters. By fusing a novel reward function with a hybrid prioritized experience replay mechanism, the proposed method significantly improves training efficiency, tracking accuracy, and robustness against external disturbances, advancing the capabilities of modern drone technology.
1. Quadrotor Dynamic Model
To develop a robust control system, we first establish a comprehensive dynamic model of the Quadrotor Unmanned Aerial Vehicle (UAV). Assuming structural symmetry and negligible aerodynamic effects at low altitudes, the Newton-Euler formalism yields the following equations of motion for position and attitude:
$$
\begin{aligned}
\ddot{x} &= \frac{U_1}{m}(\cos\phi\sin\theta\cos\psi+\sin\phi\sin\psi)-\frac{k_1}{m}\dot{x}+d_x \\
\ddot{y} &= \frac{U_1}{m}(\cos\phi\sin\theta\sin\psi-\sin\phi\cos\psi)-\frac{k_2}{m}\dot{y}+d_y \\
\ddot{z} &= \frac{U_1}{m}\cos\phi\cos\theta-g-\frac{k_3}{m}\dot{z}+d_z \\
\ddot{\phi} &= \frac{J_{yy}-J_{zz}}{J_{xx}}\dot{\theta}\dot{\psi}-\frac{J_r}{J_{xx}}\dot{\theta}\Omega_e+\frac{l}{J_{xx}}U_2-\frac{k_4}{J_{xx}}\dot{\phi}+d_\phi \\
\ddot{\theta} &= \frac{J_{zz}-J_{xx}}{J_{yy}}\dot{\phi}\dot{\psi}+\frac{J_r}{J_{yy}}\dot{\phi}\Omega_e+\frac{l}{J_{yy}}U_3-\frac{k_5}{J_{yy}}\dot{\theta}+d_\theta \\
\ddot{\psi} &= \frac{J_{xx}-J_{yy}}{J_{zz}}\dot{\phi}\dot{\theta}+\frac{1}{J_{zz}}U_4-\frac{k_6}{J_{zz}}\dot{\psi}+d_\psi
\end{aligned}
$$
where m is the mass, g is gravity, J represents moments of inertia, l is the arm length, U1 to U4 are control inputs (total thrust and torques), and d symbolizes lumped disturbances including model uncertainties and external wind.
| Parameter | Symbol | Value | Unit |
|---|---|---|---|
| Mass | m | 1.5 | kg |
| Arm length | l | 0.225 | m |
| Roll inertia | Jxx | 0.03213 | kg·m² |
| Pitch inertia | Jyy | 0.03213 | kg·m² |
| Yaw inertia | Jzz | 0.06426 | kg·m² |
| Gravity | g | 9.81 | m/s² |
| Thrust coefficient | CT | 52.38×10−6 | N/(rad·s) |
2. Control System Design
2.1 Attitude Control using Backstepping Sliding Mode
For attitude tracking, a Backstepping Sliding Mode (BSM) controller is designed. Taking the roll channel as an example, the control law is derived as:
$$ U_2 = \frac{J_{xx}}{l}\left[ \dot{x}_{2d} – c_1\dot{e}_1 – \rho_2\text{sat}(s_2) – \varepsilon_2s_2 + \frac{J_{yy}-J_{zz}}{J_{xx}}\dot{\theta}\dot{\psi} + \frac{J_r}{J_{xx}}\dot{\theta}\Omega_e + d_\phi \right] $$
where c1, ρ2, ε2 are positive constants. The saturation function sat(s) replaces the sign function to suppress chattering. Similar structures apply to pitch and yaw channels, with the complete set of tunable gains denoted as:
$$ \mathbf{c} = [c_1, c_2, c_3]^T, \quad \boldsymbol{\rho} = [\rho_1, \rho_2, \rho_3]^T, \quad \boldsymbol{\varepsilon} = [\varepsilon_1, \varepsilon_2, \varepsilon_3]^T $$
2.2 Position Control using PID with Feedforward
The outer position loop adopts a cascaded PID structure augmented with desired acceleration feedforward:
$$ u_x = K_{px}e_x + K_{ix}\int_0^t e_x d\tau + K_{dx}\dot{e}_x + \ddot{x}_d $$
$$ u_y = K_{py}e_y + K_{iy}\int_0^t e_y d\tau + K_{dy}\dot{e}_y + \ddot{y}_d $$
$$ u_z = K_{pz}e_z + K_{iz}\int_0^t e_z d\tau + K_{dz}\dot{e}_z + \ddot{z}_d $$
Here Kp, Ki, Kd are the proportional, integral, and derivative gains for each axis.
2.3 Tracking Differentiator
A third-order finite-time convergent tracking differentiator is employed to provide smooth derivatives of the desired attitude angles, enhancing the transient response of the attitude controller.
3. Improved DDPG Algorithm for Adaptive Tuning
The core contribution of this work lies in the enhanced DDPG framework, which optimizes a total of 18 controller parameters (6 PID gains for position, 9 BSM gains for attitude, and 3 additional coefficients). The state space is defined as:
$$ \mathbf{S} = [\mathbf{e}_p, \mathbf{e}_v, \mathbf{e}_a, \mathbf{e}_{\dot{a}}, \mathbf{e}_i] \in \mathbb{R}^{15} $$
where ep = [ex, ey, ez] are position errors, ev velocity errors, ea = [eφ, eθ, eψ] attitude errors, eȧ angular velocity errors, and ei the integral of position errors. The action space is:
$$ \mathbf{A} = [K_{px},K_{ix},K_{dx},K_{py},K_{iy},K_{dy},K_{pz},K_{iz},K_{dz},c_1,c_2,c_3,\rho_1,\rho_2,\rho_3,\varepsilon_1,\varepsilon_2,\varepsilon_3] \in \mathbb{R}^{18} $$
3.1 Structured Reward Function
A novel reward function is designed that integrates dynamic integral weighting and velocity-sensitive attitude error penalties:
Velocity-sensitive attitude penalty weight:
$$ w_a = w_{a0}\left(1 + \beta \|\mathbf{e}_v\|^2\right) $$
where β is a sensitivity coefficient. This penalizes attitude deviations more heavily during high-speed maneuvers, ensuring stability.
Dynamic integral weight:
$$ w_i = \begin{cases}
w_{i,\min} + (w_{i,\max} – w_{i,\min}) \cdot \frac{\|\mathbf{e}_i\|}{\kappa}, & \|\mathbf{e}_i\| < \kappa \\
w_{i,\max}, & \|\mathbf{e}_i\| \geq \kappa
\end{cases} $$
where κ is a threshold. This prevents integral windup while providing adaptive penalization of steady-state errors.
Total reward at each time step:
$$ R_t = R_p + R_v + R_a + R_u + R_i + R_s $$
where Rp = −wp∥ep∥², Rv = −wv∥ev∥², Ra = −wa∥ea∥², Ru = −wu∥u∥², Ri = −wi∥ei∥², and Rs provides a terminal reward for success or failure.
| Parameter | Description | Value |
|---|---|---|
| wp | Position error weight | 2.0 |
| wv | Velocity error weight | 0.5 |
| wa0 | Base attitude error weight | 1.5 |
| β | Velocity sensitivity factor | 0.1 |
| wu | Control effort weight | 0.01 |
| wi,min | Min integral weight | 0.2 |
| wi,max | Max integral weight | 1.0 |
| κ | Integral threshold | 0.5 |
3.2 Hybrid Prioritized Experience Replay (HPR)
To improve sample efficiency, we propose the Hybrid Prioritized Replay (HPR) mechanism, combining temporal-difference (TD) error and trajectory sparsity:
$$ P_i = \lambda \frac{|\delta_i|}{\sum_j |\delta_j|} + (1-\lambda) \frac{R_i}{\sum_j R_j} $$
where δi is the TD error of the i-th experience, Ri is the trajectory sparsity score (e.g., inverse visitation frequency), and λ ∈ [0,1] balances the two criteria. This ensures that both high-value samples (large TD error) and rare yet informative samples are replayed more frequently.
3.3 Training Procedure
The Actor-Critic networks are trained with the HPR-sampled mini-batches. The critic loss is:
$$ L(\omega) = \frac{1}{B}\sum_{i=1}^B \left( y_i – Q(s_i,a_i;\omega) \right)^2 $$
where the TD target is:
$$ y_i = r_i + \gamma Q'(s_i’, a_i’; \omega’) $$
The actor is updated via gradient ascent:
$$ \nabla_\theta J \approx \frac{1}{B}\sum_{i=1}^B \nabla_a Q(s_i,a;\omega)\big|_{a=\pi(s_i;\theta)} \nabla_\theta \pi(s_i;\theta) $$
Target networks are softly updated: θ’ ← τθ + (1-τ)θ’, ω’ ← τω + (1-τ)ω’ with τ = 0.001.
To encourage exploration, Ornstein-Uhlenbeck noise is added to the actions:
$$ \tilde{a}_t = \pi(s_t;\theta) + \mathcal{N}_t, \quad \mathcal{N}_t \leftarrow \varphi\mathcal{N}_{t-1} + \mathcal{N}(0,\sigma^2) $$
| Hyperparameter | Value |
|---|---|
| Discount factor γ | 0.99 |
| Actor learning rate | 0.001 |
| Critic learning rate | 0.0001 |
| HPR balance factor λ | 0.2 |
| Replay buffer capacity | 100,000 |
| Batch size | 64 |
| Soft update rate τ | 0.001 |
| Noise variance σ | 0.3 |
| Noise retention φ | 0.15 |
4. Simulation Results and Analysis
All simulations are conducted in MATLAB R2022b. Three benchmark experiments—hovering, spiral trajectory tracking, and disturbance rejection—validate the proposed method against the standard DDPG.
4.1 Hovering Experiment
The Quadrotor starts at (0,0,0) m and must reach and hold (0.5,0.5,0.8) m with yaw 0.7 rad. Figure A (not shown) compares the average episode reward during training. The improved DDPG converges 21.4% faster (by episode 458) and achieves an 8.33% higher final reward (1978) with 29.9% faster noise decay. The position response (Figure B) shows that the improved DDPG eliminates overshoot in the x-channel and reduces y-channel overshoot from 4.46% to near zero, while settling time is shortened by 1.2 s.
4.2 Spiral Trajectory Tracking
The reference trajectory is:
$$ x_d = (1+0.1t)\cos(0.5t), \quad y_d = (4+0.1t)\sin(0.5t), \quad z_d = 0.1t+0.2, \quad \psi_d = 1 $$
Three-dimensional tracking (Figure C) shows that the improved DDPG closely follows the desired path, whereas the standard DDPG exhibits visible deviation, especially in the fast-expanding turns. The attitude response (Figure D) reveals that the velocity-sensitive penalty reduces roll/pitch tracking error by up to 27.8% during high-speed segments (e.g., 31–33 s). The position error integral (Figure E) demonstrates that the dynamic integral weight reduces accumulated errors in x and y channels by 11.8% and 10.3%, respectively.
4.3 Disturbance Rejection
Two types of disturbances are applied:
- Gyroscope noise: Gaussian white noise with σ = 0.15 added to attitude measurements.
- Wind gust: Pulse signals of amplitude 0.8 m/s² (x), 0.6 m/s² (y), 0.4 m/s² (z) injected at t = 20 s for 2 s.
| Disturbance Type | Metric | Standard DDPG | Improved DDPG | Change (%) |
|---|---|---|---|---|
| Gyroscope noise (σ²=0.15) | RMSE (rad) | 0.0198 | 0.0143 | -27.78 |
| Mean amplitude (rad) | 0.034 | 0.021 | -38.23 | |
| Mean frequency (Hz) | 5.53 | 3.35 | -39.42 | |
| Wind gust pulse | Max position deviation (m) | 0.75 | 0.37 | -50.66 |
| Recovery time (s) | 13.6 | 6.26 | -53.97 |
These results confirm that the trained policy from the improved DDPG is significantly more robust to both sensor noise and external wind gusts. The dynamic weight on attitude errors ensures tight attitude control even under noise, while the adaptive integral action aids quick recovery after pulse disturbances.
5. Conclusion
This study presents an enhanced DDPG-based adaptive control framework for quadrotor UAVs. By incorporating a tailored reward function with velocity-sensitive attitude penalties and dynamic integral weights, and by introducing the Hybrid Prioritized Replay mechanism that balances TD error and trajectory sparsity, the training efficiency and final control performance are substantially improved over the standard DDPG. Simulation results across hovering, high-dynamic spiral tracking, and disturbance rejection tasks consistently demonstrate faster convergence, higher tracking accuracy (up to 27.8% reduction in attitude error), and superior robustness (up to 53.97% reduction in recovery time). These advancements contribute meaningfully to the field of drone technology, paving the way for real-time adaptive control in challenging operational conditions. Future work will focus on sim-to-real transfer to validate the approach on physical platforms.

