Adaptive Trajectory Control for Unmanned Drones Using an Enhanced Deep Deterministic Policy Gradient Strategy

The application of unmanned drones has rapidly expanded from military domains into commercial, industrial, and civilian sectors. Whether for aerial photography, agricultural monitoring, or logistics delivery, the control precision of an unmanned drone directly determines mission success. Especially in complex operational environments, the controller parameters of the unmanned drone must be adaptively adjusted in response to changing tasks and conditions. Therefore, achieving efficient, dynamic tuning of controller parameters is a critical challenge for autonomous unmanned drone control.

Researchers have proposed various adaptive parameter tuning strategies based on classical control methods. These include real-time PID parameter optimization using meta-heuristic algorithms like Particle Swarm Optimization and Differential Evolution, which reduce errors from manual tuning. Methods combining adaptive backstepping with sliding mode control introduce parameter estimation laws to mitigate model uncertainties. While these approaches improve tuning efficiency to some extent, they often suffer from high dependency on model accuracy, insufficient real-time capability, and weak adaptability to varying mission environments, making truly efficient adaptive control difficult.

Deep Reinforcement Learning (DRL), with its ability to learn optimal policies through environmental interaction, offers a promising solution to these challenges. However, standard DRL algorithms themselves face issues such as low sample efficiency, sensitivity to hyperparameters, slow convergence, and instability. To address these limitations while ensuring high control performance for the unmanned drone, this paper proposes an adaptive parameter optimization method based on an enhanced Deep Deterministic Policy Gradient (DDPG) strategy.

System Modeling and Controller Design

The unmanned drone considered is a quadrotor. Let the Earth-fixed frame be denoted by $\{X_e, Y_e, Z_e\}$ and the body-fixed frame by $\{X_b, Y_b, Z_b\}$. The roll ($\phi$), pitch ($\theta$), and yaw ($\psi$) angles are defined as rotations around the $X_b$, $Y_b$, and $Z_b$ axes, respectively. The dynamic model of the unmanned drone, derived using the Newton-Euler formalism under standard assumptions (rigid body, symmetric structure, center of mass coinciding with body frame origin, negligible aerodynamic effects at low speed), is given by:

$$
\begin{aligned}
\ddot{x} &= \frac{U_1}{m}(\cos\phi \sin\theta \cos\psi + \sin\phi \sin\psi) – \frac{k_1}{m}\dot{x} + d_x \\
\ddot{y} &= \frac{U_1}{m}(\cos\phi \sin\theta \sin\psi – \sin\phi \cos\psi) – \frac{k_2}{m}\dot{y} + d_y \\
\ddot{z} &= \frac{U_1}{m}(\cos\phi \cos\theta) – g – \frac{k_3}{m}\dot{z} + d_z \\
\ddot{\phi} &= \dot{\theta}\dot{\psi}\frac{J_y – J_z}{J_x} – \frac{J_r}{J_x}\dot{\theta}\omega_e + \frac{l}{J_x}U_2 – \frac{k_4}{J_x}\dot{\phi} + d_\phi \\
\ddot{\theta} &= \dot{\phi}\dot{\psi}\frac{J_z – J_x}{J_y} + \frac{J_r}{J_y}\dot{\phi}\omega_e + \frac{l}{J_y}U_3 – \frac{k_5}{J_y}\dot{\theta} + d_\theta \\
\ddot{\psi} &= \dot{\phi}\dot{\theta}\frac{J_x – J_y}{J_z} + \frac{l}{J_z}U_4 – \frac{k_6}{J_z}\dot{\psi} + d_\psi
\end{aligned}
$$

Here, $(x, y, z)$ is the position in the inertial frame; $(\phi, \theta, \psi)$ are the Euler angles; $m$ is the mass; $g$ is gravitational acceleration; $J_x, J_y, J_z$ are moments of inertia; $J_r$ is the rotor inertia; $\omega_e = \omega_1 – \omega_2 + \omega_3 – \omega_4$ is the overall rotor speed disturbance; $l$ is the arm length; $k_1$ to $k_6$ are damping coefficients; and $U_1$ to $U_4$ are the control inputs (total thrust and torques). The terms $d_x, d_y, d_z, d_\phi, d_\theta, d_\psi$ represent lumped disturbances encompassing both model uncertainties and external forces acting on the unmanned drone.

The control system for the unmanned drone employs a hierarchical structure. The inner loop for attitude control is designed using a Backstepping Sliding Mode Control (BSMC) strategy to ensure robustness. Taking the roll channel as an example, the control law is derived as:

$$
\begin{aligned}
U_2 &= \frac{J_x}{l} \Big[ \ddot{\phi}_d + c_1 \dot{e}_\phi – \dot{\theta}\dot{\psi}\frac{J_y – J_z}{J_x} + \frac{J_r}{J_x}\dot{\theta}\omega_e \\
&\quad + \frac{k_4}{J_x}\dot{\phi} – \rho_\phi \cdot \text{sat}(s_\phi) – \epsilon_\phi \cdot e_\phi \Big] – \hat{d}_\phi
\end{aligned}
$$

where $e_\phi = \phi – \phi_d$ is the tracking error, $s_\phi = \dot{e}_\phi + c_1 e_\phi$ is the sliding surface, $\text{sat}(\cdot)$ is a saturation function replacing the sign function to reduce chattering, and $c_1, \rho_\phi, \epsilon_\phi$ are controller gains. Similar controllers $U_3$ and $U_4$ are derived for the pitch and yaw channels of the unmanned drone, respectively.

The outer loop for position control of the unmanned drone uses a cascade PID structure enhanced with feedforward compensation from the desired acceleration:

$$
\begin{aligned}
u_x &= K_{px}e_x + K_{ix}\int e_x \,d\tau + K_{dx}\dot{e}_x + \ddot{x}_d \\
u_y &= K_{py}e_y + K_{iy}\int e_y \,d\tau + K_{dy}\dot{e}_y + \ddot{y}_d \\
u_z &= K_{pz}e_z + K_{iz}\int e_z \,d\tau + K_{dz}\dot{e}_z + \ddot{z}_d
\end{aligned}
$$

where $\mathbf{e}_p = [e_x, e_y, e_z]^T = [x_d-x, y_d-y, z_d-z]^T$ is the position error vector for the unmanned drone. The desired attitude angles $\phi_d, \theta_d$ are computed from $u_x, u_y, u_z$, while $\psi_d$ is independently specified.

Enhanced DDPG for Adaptive Parameter Tuning

The core innovation is using an enhanced DDPG algorithm to dynamically optimize the key parameters of the unmanned drone’s position PID and attitude BSMC controllers. The parameters tuned by the DRL agent form the action space $\mathcal{A}$:

$$
\mathcal{A} = [\mathbf{K}_p, \mathbf{K}_i, \mathbf{K}_d, \boldsymbol{\rho}, \boldsymbol{\epsilon}, \mathbf{c}]
$$

where $\mathbf{K}_p, \mathbf{K}_i, \mathbf{K}_d \in \mathbb{R}^3$ are the PID gains for the position loop, and $\boldsymbol{\rho}, \boldsymbol{\epsilon}, \mathbf{c} \in \mathbb{R}^3$ are the BSMC parameters (reaching rate, chattering suppression, and error gain) for the roll, pitch, and yaw channels of the unmanned drone.

The state space $\mathcal{S}$ for the unmanned drone includes tracking errors and their integrals:

$$
\mathcal{S} = [\mathbf{e}_p, \dot{\mathbf{e}}_p, \mathbf{e}_a, \dot{\mathbf{e}}_a, \int \mathbf{e}_p]
$$
where $\mathbf{e}_a = [e_\phi, e_\theta, e_\psi]^T$ is the attitude error.

Structured Reward Function

A novel reward function is designed to guide the learning of the unmanned drone’s controller. It combines multiple weighted components:

1. Position & Velocity Error Penalty: $R_p = -w_p \|\mathbf{e}_p\|^2, \quad R_v = -w_v \|\dot{\mathbf{e}}_p\|^2$

2. Velocity-Sensitive Attitude Penalty: To enforce stricter attitude stability during high-speed maneuvers of the unmanned drone, a dynamic weight is used:
$$ w_a = w_{a0} \cdot (1 + \beta \|\dot{\mathbf{e}}_p\|^2) $$
The attitude reward is then: $R_a = -w_a \|\mathbf{e}_a\|^2$.

3. Dynamically Weighted Integral Penalty: To manage steady-state error and prevent integral windup in the unmanned drone’s control, the weight on the integrated error is adjusted dynamically:
$$ w_i = \begin{cases}
w_{i,\text{min}} + (w_{i,\text{max}} – w_{i,\text{min}}) \cdot \frac{\|\int \mathbf{e}_p\|}{\Upsilon}, & \text{if } \|\int \mathbf{e}_p\| < \Upsilon \\
w_{i,\text{max}}, & \text{otherwise}
\end{cases} $$
The integral reward is: $R_i = -w_i \|\int \mathbf{e}_p\|^2$.

4. Control Effort Penalty: $R_u = -w_u \|\mathbf{U}\|^2$, where $\mathbf{U}$ is the vector of control inputs for the unmanned drone.

5. Terminal Reward: A sparse reward $R_s$ is given for task success (tracking error below a threshold $\delta$) or failure (error exceeds safety limits).

The total reward for the unmanned drone at time step $t$ is: $R_t = R_p + R_v + R_a + R_i + R_u + R_s$.

Hybrid Prioritized Experience Replay (HPR)

To improve sample efficiency and learning stability during the training of the unmanned drone’s control policy, a Hybrid Prioritized Experience Replay (HPR) mechanism is proposed. The priority $P_i$ for a transition $i$ in the replay buffer is a blend of its Temporal-Difference (TD) error and its trajectory sparsity score:

$$
P_i = \lambda \cdot \frac{|\delta_i|}{\sum_j |\delta_j|} + (1-\lambda) \cdot \frac{R_i}{\sum_j R_j}
$$

where $\delta_i$ is the TD-error for transition $i$, $R_i$ is a measure of its sparsity/rarity in the state-action space of the unmanned drone, and $\lambda \in [0,1]$ is a weighting coefficient. Transitions are sampled with probability proportional to $P_i$, ensuring that both highly instructive (large TD-error) and rare experiences for the unmanned drone are replayed more frequently.

Algorithm Framework

The training follows an Actor-Critic framework. The Actor network $\mu(s|\theta^\mu)$ maps the state $s$ of the unmanned drone to a deterministic action $a$ (controller parameters). The Critic network $Q(s,a|\theta^Q)$ estimates the value of taking action $a$ in state $s$. The networks are updated using gradients derived from the deterministic policy gradient theorem and the minimization of TD-error, with target networks for stability. The exploration for the unmanned drone is achieved by adding temporally correlated noise $\mathcal{N}$ to the Actor’s output: $a_t = \mu(s_t|\theta^\mu) + \mathcal{N}_t$.

The key model and training parameters for the unmanned drone system are summarized below:

Parameter	Value	Unit
Mass ($m$)	1.5	kg
Arm Length ($l$)	0.225	m
$J_x$, $J_y$	0.03213	kg·m²
$J_z$	0.06426	kg·m²
Thrust Coef. ($C_T$)	5.238e-5	N/(rad/s)²
Torque Coef. ($C_M$)	3.51e-7	N·m/(rad/s)²

Training Parameter	Value
Discount Factor ($\gamma$)	0.99
Actor Learning Rate	0.001
Critic Learning Rate	0.0001
HPR Weight ($\lambda$)	0.2
Soft Update Rate ($\tau$)	0.001
Replay Buffer Size	100,000

Simulation Experiments and Analysis

Simulations were conducted to validate the proposed enhanced DDPG method for unmanned drone control, focusing on training efficiency, trajectory tracking accuracy, and disturbance rejection.

Point Hovering

The unmanned drone was commanded to hover at a fixed point $(0.5, 0.5, 0.8)$ m. The training convergence curve, shown below, clearly demonstrates the superiority of the enhanced DDPG. It achieved approximately 21.4% faster convergence and an 8.33% higher final average reward compared to the standard DDPG, with significantly reduced volatility, validating the effectiveness of the HPR and structured reward.

The tracking performance after training is summarized in the following table, showing the improved DDPG provides smoother response with less overshoot and shorter settling time for the unmanned drone.

Channel	Metric	Standard DDPG	Enhanced DDPG	Improvement
X-Position	Overshoot	1.67%	~0%	~100% reduction
X-Position	Settling Time	~3.5 s	~2.3 s	~34.3% faster
Y-Position	Overshoot	4.46%	< 1%	> 77% reduction
Y-Position	Settling Time	~4.2 s	~3.0 s	~28.6% faster

Accelerating Helical Trajectory Tracking

The unmanned drone was tasked to track a complex accelerating helical path defined by:
$$ x_d = (1+0.1t)\cos(0.5t), \quad y_d = (4+0.1t)\sin(0.5t), \quad z_d = 0.1+0.2t $$
The 3D tracking results showed the unmanned drone controlled by the enhanced DDPG adhered much more closely to the desired trajectory, especially during high-speed, large-radius turns. The integral of position error, which indicates accumulated deviation, was significantly lower with the enhanced method (e.g., reductions of 11.8% in X and 10.3% in Y channels), demonstrating its superior long-term accuracy for the unmanned drone.

Robustness to Disturbances

The robustness of the trained unmanned drone controller was tested under two conditions: 1) Strong gyroscope noise ($\sigma^2=0.15$), and 2) Impulse wind gusts applied to position channels. The enhanced DDPG showed markedly better disturbance rejection.

Disturbance Type	Performance Metric	Standard DDPG	Enhanced DDPG	Improvement
Gyro Noise	Attitude MSE (rad)	0.0198	0.0143	27.8% lower
	Avg. Oscillation Amp.	0.034 rad	0.021 rad	38.2% lower
	Avg. Oscillation Freq.	5.53 Hz	3.35 Hz	39.4% lower
Wind Gust	Max. Deviation	0.75 m	0.37 m	50.7% lower
Wind Gust	Recovery Time	13.6 s	6.26 s	53.9% faster

Conclusion

This paper presented an adaptive trajectory control method for an unmanned drone based on an enhanced Deep Deterministic Policy Gradient algorithm. By designing a structured reward function that incorporates velocity-sensitive attitude penalties and dynamic integral weighting, along with a Hybrid Prioritized Experience Replay mechanism, the training efficiency and final control performance were significantly improved. Simulation results across point hovering, dynamic trajectory tracking, and disturbance rejection scenarios confirm that the unmanned drone controller tuned by the enhanced DDPG achieves superior tracking precision, dynamic response, and robustness compared to its standard counterpart. Future work will focus on the practical deployment of this DRL-based tuning strategy on a physical unmanned drone platform to bridge the simulation-to-reality gap.