An Improved TD3 Approach for Quadrotor Drone Obstacle Avoidance

Quadrotor drones have gained widespread favor due to their compact size, agility, and ability to perform tasks inconvenient or hazardous for humans. They excel in numerous fields such as industrial inspection, disaster response, and daily assistance. However, this rapid development has been accompanied by a yearly increase in incidents involving quadrotor drones causing injuries and property damage, even affecting airspace security. Consequently, ensuring that quadrotor drones possess autonomous obstacle avoidance capability is a fundamental and critical functional requirement, serving as a prerequisite for executing various complex operational tasks.

In recent years, reinforcement learning has experienced rapid development, demonstrating outstanding performance in the field of artificial intelligence. Many researchers have employed the reinforcement learning framework to study obstacle avoidance and path planning for agents. Compared to traditional methods like artificial potential fields, visibility graphs, and particle swarm optimization, reinforcement learning methods hold greater advantages in complex and challenging environments. Relative to smart cars or mobile robots, the obstacle avoidance scenario for a quadrotor drone is more complex due to its additional degrees of freedom in motion. Scholars worldwide have conducted related research. Some applied Q-learning algorithms to obstacle avoidance and path planning for quadrotor drones in indoor simulated environments, with experiments showing the trained Q-learning algorithm was superior to the A* algorithm in terms of time efficiency. Others focused on a quadrotor drone equipped with a single front-facing camera, proposing a deep reinforcement learning algorithm based on dataset fusion to achieve autonomous obstacle avoidance in dense, cluttered environments. Another approach introduced an uncertainty-aware deep reinforcement learning method that, by estimating collision probability, enabled the quadrotor drone to maintain “vigilance” in unfamiliar, unknown environments, reducing speed and minimizing collision risk. Researchers from the Chinese Academy of Sciences proposed a reinforcement learning algorithm inspired by the prefrontal cortex-basal ganglia circuit to achieve obstacle avoidance control for drones. A team from the Hong Kong University of Science and Technology utilized the Deep Deterministic Policy Gradient (DDPG) algorithm to plan desired paths for a quadrotor drone, combining it with a PID controller in a hierarchical structure to accomplish collision-free target tracking tasks. The DDPG algorithm, as a classic method for continuous action control, has been widely applied to obstacle avoidance and path planning problems. However, it suffers from Q-value overestimation bias. When this cumulative error reaches a certain level, it can lead to updates towards suboptimal policies and divergent behavior. The Twin Delayed Deep Deterministic Policy Gradient (TD3) algorithm was proposed to address this issue, with studies indicating its performance surpasses that of DDPG. The classical TD3 algorithm uses random sampling during training, resulting in varying data quality which affects training effectiveness.

This work approaches the problem from the perspective of improving training data quality. We propose an Improved Twin Delayed Deep Deterministic Policy Gradient (I-TD3) algorithm and apply it to the obstacle avoidance problem for quadrotor drones. Finally, using the AirSim simulation platform, we compare the obstacle avoidance performance of the I-TD3 algorithm with the classical TD3 and DDPG algorithms. Experimental results indicate that the proposed I-TD3 algorithm exhibits superior obstacle avoidance performance compared to the classical TD3 and DDPG algorithms.

1. Background and Theoretical Foundation

1.1 Reinforcement Learning and Markov Decision Process

The learning process of reinforcement learning can be modeled as a Markov Decision Process (MDP), defined by the tuple $$\{S, A, P, R, \gamma\}$$.

$$S$$: The set of environment states.
$$A$$: The set of actions available to the agent.
$$P$$: The state transition model. $$P_{ss’}^a = P[S_{t+1}=s’ | S_t=s, A_t=a]$$ denotes the probability of transitioning to state $$s’$$ after taking action $$a$$ in state $$s$$.
$$R$$: The reward function. $$r_t = R[S_t=s, A_t=a]$$ represents the reward received after taking action $$a$$ in state $$s$$.
$$\gamma \in [0, 1]$$: The discount factor, which balances the weight between immediate and future rewards.

The agent’s behavior is governed by a policy $$\pi$$. The goal is to find the optimal policy $$\pi^*$$ that maximizes the expected cumulative return. This is often achieved by finding the optimal state-value function $$V^*(s)$$ or action-value function $$Q^*(s,a)$$.

1.2 From DDPG to TD3

For continuous control tasks like quadrotor drone navigation, policy gradient methods are essential. The Deep Deterministic Policy Gradient (DDPG) algorithm is a seminal actor-critic method that combines insights from DQN and deterministic policy gradients. It maintains an actor network $$\mu(s|\theta^\mu)$$ that maps states to deterministic actions, and a critic network $$Q(s,a|\theta^Q)$$ that estimates the value of state-action pairs. Target networks are used for stability.

However, DDPG is known to suffer from overestimation bias in the critic. The Twin Delayed Deep Deterministic Policy Gradient (TD3) algorithm addresses this with three key modifications:

Twin Critics: Two separate critic networks $$Q_{\theta_1}$$ and $$Q_{\theta_2}$$ are trained independently. The smaller of their two target values is used to form the regression target, mitigating overestimation.
$$y = r + \gamma \min_{i=1,2} Q_{\theta_i’}(s’, \tilde{a}’)$$
Target Policy Smoothing: Noise is added to the target action to prevent the policy from exploiting sharp peaks in the Q-function.
$$\tilde{a}’ = \mu_{\phi’}(s’) + \epsilon, \quad \epsilon \sim \text{clip}(\mathcal{N}(0, \sigma), -c, c)$$
Delayed Policy Updates: The actor and target networks are updated less frequently than the critics, allowing the value estimate to stabilize first.

1.3 Prioritized Experience Replay

Experience Replay is a crucial component that breaks temporal correlations by storing experiences $$(s_t, a_t, r_t, s_{t+1})$$ in a buffer and sampling mini-batches randomly for training. Prioritized Experience Replay (PER) enhances this by assigning a priority $$p_i$$ to each transition, typically based on the Temporal-Difference (TD) error $$\delta_i$$:
$$p_i = |\delta_i| + \epsilon$$
where $$\epsilon$$ is a small constant ensuring non-zero priority. Transitions are sampled with probability $$P(i) = p_i^\alpha / \sum_k p_k^\alpha$$, where $$\alpha$$ controls the degree of prioritization. This focuses learning on surprising or more informative experiences, potentially speeding up convergence for the quadrotor drone agent.

2. The Proposed I-TD3 Algorithm for Quadrotor Drones

The core innovation of the I-TD3 algorithm lies in its structured management of experience data and a refined reward function, specifically tailored for the quadrotor drone obstacle avoidance task.

2.1 Dual Experience Buffer Architecture

A key limitation of standard experience replay is the uniform random sampling of experiences, which may include many uninformative transitions (e.g., repetitive hovering, early crashes). For a learning quadrotor drone, experiences leading to success (reaching the goal) and those leading to failure (collision) have different pedagogical value.

I-TD3 addresses this by segregating experiences into three distinct buffers:

Buffer Name	Purpose	Sampling Method	Size
$$M_{\text{temp}}$$	Temporary storage for the most recent $$\alpha$$ experiences of the current episode.	First-In-First-Out (FIFO)	Fixed ($$\alpha$$, e.g., 10)
$$M_{\text{success}}$$	Permanent storage for sequences of experiences from successful episodes.	Prioritized Experience Replay (PER)	Grows with training
$$M_{\text{failure}}$$	Permanent storage for sequences from failed episodes.	Uniform Random Sampling	Grows with training

Mechanism: During an episode, every transition is stored in $$M_{\text{temp}}$$. Once $$M_{\text{temp}}$$ is full, the oldest experience is moved to $$M_{\text{success}}$$, maintaining a rolling window of the $$\alpha$$ most recent steps. This assumes that if the quadrotor drone is still flying at time $$t$$, the experience from time $$t-\alpha$$ contributed positively to its survival. When an episode terminates, the entire content of $$M_{\text{temp}}$$ is transferred to either $$M_{\text{success}}$$ (if the episode was successful) or $$M_{\text{failure}}$$ (if it failed due to collision or timeout). This structure ensures that $$M_{\text{success}}$$ contains contiguous sequences of “good” flight behavior, while $$M_{\text{failure}}$$ contains sequences that led to mistakes.

2.2 Hybrid Sampling Strategy

During training, a mini-batch of size $$m$$ is assembled from the two main buffers according to the rule:
$$ n_{\text{failure}} = \beta m $$
$$ n_{\text{success}} = m – n_{\text{failure}} $$
where $$\beta$$ is the failure sample ratio (e.g., 0.05).

Experiences from $$M_{\text{success}}$$ are sampled using Prioritized Experience Replay. The TD error from the twin critics is used to update their priority, focusing learning on the more challenging or informative successful maneuvers for the quadrotor drone.
Experiences from $$M_{\text{failure}}$$ are sampled uniformly. The goal here is not to prioritize specific failures but to provide a steady, random reminder of the negative outcomes the quadrotor drone must avoid, preventing catastrophic forgetting of failure modes.

This hybrid strategy increases the sampling efficiency of high-value successful experiences while maintaining exposure to failure patterns, accelerating and stabilizing the learning process for the quadrotor drone agent.

2.3 Enhanced Reward Function for Quadrotor Drone Navigation

A well-shaped reward function is critical for guiding the quadrotor drone. The proposed reward function $$r_t$$ is defined as follows:

$$
r_t =
\begin{cases}
+10, & \text{if } s_{t+1} \text{ is the goal state (successful arrival).} \\
-2, & \text{if } s_{t+1} \text{ is a terminal collision state.} \\
-0.05, & \text{if forward velocity } v_y(s_{t+1}) \leq \text{speed\_limit} \text{ (penalty for stalling).} \\
+0.1 \cdot v_y(s_{t+1}), & \text{if forward velocity } v_y(s_{t+1}) > 0 \text{ (reward for forward progress).} \\
+0.5 \cdot v_y(s_{t+1}), & \text{if forward velocity } v_y(s_{t+1}) < 0 \text{ (stronger penalty for moving backward).}
\end{cases}
$$

This function provides clear, sparse signals for terminal states (large positive/negative reward). More importantly, the shaping rewards for velocity create a strong incentive for the quadrotor drone to move purposefully towards the goal. The asymmetry between the coefficients for positive and negative $$v_y$$ (0.1 vs. 0.5) creates a significant disincentive for the agent to oscillate or move backward, encouraging efficient, direct flight paths through the obstacle field.

2.4 Algorithm Summary and Pseudo-Code

The I-TD3 algorithm integrates the TD3 architecture with the dual-buffer system and the new reward function. The following table summarizes the key components:

Component	I-TD3 Implementation
Critic Networks	Two networks $$Q_{\theta_1}, Q_{\theta_2}$$. Target: $$y = r + \gamma \min_{i=1,2} Q_{\theta_i’}(s’, \tilde{a}’)$$.
Actor Network	Policy network $$\mu_\phi(s)$$. Updated less frequently (delayed).
Target Policy Smoothing	Yes. $$\tilde{a}’ = \mu_{\phi’}(s’) + \epsilon$$, $$\epsilon \sim \text{clip}(\mathcal{N}(0, \sigma), -c, c)$$.
Experience Buffers	Three buffers: $$M_{\text{temp}}$$, $$M_{\text{success}}$$ (PER), $$M_{\text{failure}}$$ (Uniform).
Sampling	Hybrid: $$n_{\text{failure}}$$ from $$M_{\text{failure}}$$, $$n_{\text{success}}$$ from $$M_{\text{success}}$$ via PER.
Reward Function	As defined in Section 2.3, tailored for quadrotor drone navigation.

Pseudo-Code Outline:

Initialize actor $$\mu_\phi$$, critics $$Q_{\theta_1}, Q_{\theta_2}$$, and their target networks.
Initialize empty buffers $$M_{\text{success}}$$ (PER structure), $$M_{\text{failure}}$$, and $$M_{\text{temp}}$$.
For episode = 1 to M do:
- Reset environment, get initial state $$s_0$$, clear $$M_{\text{temp}}$$.
- For t = 0 to T do:
  1. Select action $$a_t = \mu_\phi(s_t) + \mathcal{N}_t$$ (exploration noise).
  2. Execute $$a_t$$, observe $$r_t$$, $$s_{t+1}$$ (using the I-TD3 reward function).
  3. Store $$(s_t, a_t, r_t, s_{t+1})$$ in $$M_{\text{temp}}$$.
  4. If $$|M_{\text{temp}}| == \alpha$$: Move oldest transition from $$M_{\text{temp}}$$ to $$M_{\text{success}}$$.
  5. Sample mini-batch: Get $$n_{\text{failure}}$$ transitions from $$M_{\text{failure}}$$ uniformly, and $$n_{\text{success}}$$ from $$M_{\text{success}}$$ via PER.
  6. Update critics and actor networks using TD3 update rules with the sampled batch.
  7. Update priorities for the $$n_{\text{success}}$$ samples in the PER tree.
  8. Soft update target networks.
- End For (Episode ends).
- Transfer all experiences from $$M_{\text{temp}}$$ to $$M_{\text{success}}$$ (if goal reached) or $$M_{\text{failure}}$$ (if collision/timeout).
End For.

3. Experimental Evaluation

3.1 Simulation Environment and Setup

Experiments were conducted using the AirSim simulator, which provides high-fidelity physics and visual simulation for quadrotor drones. A custom, challenging “narrow multi-obstacle channel” map was created. The quadrotor drone must start from a designated point and navigate through a series of closely placed obstacles within a corridor to reach an exit point. This environment tests precision and robust obstacle avoidance.

Parameter Category	Settings
Simulation Platform	AirSim (Unreal Engine), Windows 10.
Hardware	Intel Xeon E5-2673v3, GeForce RTX 2080Ti, 32GB RAM.
Software Stack	TensorFlow 1.13.1, CUDA 10.0.
Quadrotor Drone Model	Default AirSim quadrotor model with state API (position, velocity, orientation).
State Representation ($$s_t$$)	Relative position to goal, current velocity vector, relative distance to nearest obstacle (simulated via depth or position).
Action Space ($$a_t$$)	Normalized continuous values for pitch, roll, yaw rate, and throttle/vertical speed.
Comparison Algorithms	I-TD3 (proposed), Standard TD3, DDPG.
Training Episodes	2000 episodes per algorithm.
Key Hyperparameters	$$\gamma=0.99$$, actor/critic learning rate=0.001, $$\tau$$(soft update)=0.005, batch size $$m=64$$, $$\alpha$$(buffer window)=10, $$\beta$$(failure ratio)=0.05.

3.2 Results and Analysis

The performance of the quadrotor drone under the three algorithms was evaluated using three metrics: (1) Steps taken in successful episodes, (2) Obstacle avoidance success rate over training, and (3) Average straight-line (crow-fly) distance traveled per episode, indicating how close the drone gets to the goal before failure.

1. Efficiency of Learned Policy (Steps per Success): The I-TD3 algorithm achieved its first successful episode around episode 260, significantly earlier than TD3 (~1100 episodes) and DDPG (~1150 episodes). Furthermore, the number of steps required for success rapidly decreased and stabilized around 60 steps for I-TD3. In contrast, TD3 and DDPG required more steps (80+ with higher variance). This demonstrates that I-TD3 learns a more efficient and stable navigation policy for the quadrotor drone faster.

2. Success Rate: The success rate, calculated over a rolling window of 50 episodes, clearly shows the superiority of I-TD3.

Algorithm	First Major Rise	Stable Performance Plateau	Final Success Rate (Approx.)	Stability
I-TD3	~1300 episodes	High and stable after ~1500 episodes	>70%	High
TD3	~1600 episodes	Moderate, some fluctuations	~50-60%	Medium
DDPG	~1600 episodes	Low, with large oscillations	<50%	Low

3. Average Crow-Fly Distance: This metric measures how far the quadrotor drone travels towards the goal before an episode ends (in both success and failure). A higher average distance indicates the agent is getting closer to the goal even when it fails. I-TD3 consistently maintained a higher average distance than both baselines after about 600 episodes. Crucially, between episodes 1400-2000, I-TD3’s high average distance coincided with its high success rate, implying that failures, when they occurred, happened very close to the goal. This contrasts with TD3 and DDPG, which showed lower and more variable distances.

3.3 Interpretation of Results

The experimental results validate the effectiveness of the I-TD3 improvements for quadrotor drone obstacle avoidance:

Faster and More Stable Learning: The early and stable rise in performance of I-TD3 can be attributed to the dual-buffer system. By densely packing $$M_{\text{success}}$$ with contiguous good trajectories and sampling them efficiently via PER, the quadrotor drone agent quickly learns effective flight strategies. The small, uniform sample from $$M_{\text{failure}}$$ prevents catastrophic forgetting without derailing learning with excessive negative examples.
More Efficient Policies: The lower step count per success and the shape of the reward function (penalizing backward motion) guide the I-TD3 quadrotor drone to find shorter, more direct paths through the obstacle course.
Superior Final Performance: The combination of a better data curriculum (via buffer management) and a well-shaped reward signal allows I-TD3 to achieve a higher asymptotic success rate than the classical TD3 and DDPG algorithms in this challenging quadrotor drone navigation task.

4. Conclusion and Future Work

This paper presented the I-TD3 algorithm, an enhanced version of the Twin Delayed Deep Deterministic Policy Gradient algorithm, specifically designed to improve autonomous obstacle avoidance for quadrotor drones. The core innovations involve the strategic separation of successful and failed flight experiences into dual replay buffers, coupled with a hybrid sampling strategy that combines Prioritized Experience Replay for successful maneuvers with uniform sampling for failures. This architecture significantly increases the sampling efficiency of informative data, alleviating issues related to low training efficiency caused by an abundance of uninformative experiences. Furthermore, an asymmetric reward function was designed to incentivize efficient, forward progress for the quadrotor drone while strongly discouraging hesitation or regression.

Extensive simulation experiments on the AirSim platform within a custom narrow-obstacle channel environment demonstrated the superiority of I-TD3. The proposed algorithm enabled the quadrotor drone to learn effective obstacle avoidance policies faster, achieve a higher final success rate, and produce more efficient flight paths compared to both the standard TD3 and DDPG algorithms.

However, this work focused on static obstacles. A critical direction for future research is to extend this framework to dynamic environments where obstacles are moving. This would require the quadrotor drone’s state representation to include velocity estimates of other objects and likely more sophisticated network architectures or training paradigms (e.g., incorporating recurrent layers for memory, or using centralized training with decentralized execution for multi-agent scenarios). Investigating the transfer of policies learned in simulation to real-world quadrotor drones, dealing with sensor noise and model discrepancy, remains another vital and challenging avenue for future work.