In recent years, UAV drones have gained significant attention due to their low cost, high mobility, and versatility in applications such as autonomous navigation, surveillance, disaster rescue, and logistics. As operational environments become increasingly complex, efficient path planning for UAV drones is critical to ensure safety, minimize energy consumption, and achieve mission objectives. Path planning involves finding an optimal or near-optimal trajectory from a start point to a goal while avoiding obstacles and satisfying constraints like smoothness and time efficiency. Traditional methods, including A* algorithm, Dijkstra algorithm, and artificial potential field (APF), often struggle with high-dimensional spaces, dynamic environments, and real-time requirements. Similarly, intelligent optimization algorithms like ant colony optimization, genetic algorithms, and particle swarm optimization may suffer from slow convergence, local optima, or poor scalability. To address these limitations, reinforcement learning (RL), particularly deep reinforcement learning (DRL), has emerged as a promising approach for UAV drone path planning, leveraging neural networks to approximate value functions and learn policies through interaction with the environment.
Among DRL algorithms, Deep Q-Network (DQN) has shown success in various domains by combining Q-learning with deep neural networks. However, standard DQN applied to UAV drone path planning faces challenges such as sparse rewards, inefficient exploration, numerous waypoints, and inadequate consideration of obstacle safety distances. These issues can lead to slow convergence, suboptimal paths, and poor generalization in complex 3D environments. In this work, we propose an improved DQN-based algorithm, termed RNDQN (Replay-Noise-based Deep Q-Network), which incorporates prioritized experience replay and noise exploration mechanisms to enhance learning efficiency and robustness. Additionally, we design a composite reward function inspired by artificial potential field theory, integrating attractive and repulsive forces along with heading stability and time penalties to guide the UAV drone effectively. Our contributions include a systematic framework for 3D path planning that reduces path length and waypoints while ensuring safety and efficiency. This research provides a theoretical foundation and practical solution for rapid and reliable UAV drone navigation in cluttered environments.

The core of DQN lies in approximating the action-value function \(Q(s,a)\), which represents the expected cumulative reward when taking action \(a\) in state \(s\) and following an optimal policy thereafter. Traditional Q-learning uses a table to store \(Q\) values, but this becomes infeasible for large or continuous state spaces. DQN addresses this by employing a deep neural network parameterized by \(\theta\) to estimate \(Q(s,a;\theta)\). The objective is to maximize the discounted cumulative reward, expressed as:
$$ Q^*(s,a) = \max_{\pi} \mathbb{E} \left[ \sum_{t=0}^{\infty} \gamma^t r_t \mid s_0 = s, a_0 = a, \pi \right] $$
where \(\gamma \in [0,1]\) is the discount factor balancing immediate and future rewards, and \(\pi\) denotes the policy. To stabilize training, DQN utilizes two key techniques: experience replay and a target network. Experience replay stores transition tuples \((s, a, r, s’)\) in a buffer \(D\), from which mini-batches are sampled to break temporal correlations. The target network, with parameters \(\theta^-\), is a delayed copy of the main network used to compute target Q-values, reducing instability. The loss function for updating the main network is the mean-squared error between the predicted and target Q-values:
$$ L(\theta) = \mathbb{E}_{(s,a,r,s’) \sim D} \left[ \left( y – Q(s,a;\theta) \right)^2 \right] $$
where the target \(y\) is given by:
$$ y = r + \gamma \max_{a’} Q(s’, a’; \theta^-) $$
During training, the UAV drone explores the environment by selecting actions based on an \(\epsilon\)-greedy policy, which chooses a random action with probability \(\epsilon\) or the action with the highest Q-value otherwise. While effective in simple settings, this approach can lead to inefficient exploration and slow convergence in complex 3D spaces for UAV drones, motivating our improvements.
To enhance the performance of DQN for UAV drone path planning, we introduce two modifications: prioritized experience replay and noise-based exploration. Prioritized experience replay assigns higher sampling probabilities to transitions with larger temporal-difference (TD) errors, as these are deemed more informative for learning. The TD error for a transition \(i\) is defined as:
$$ \delta_i = | r_i + \gamma \cdot \max_{a’} \mathbb{E}_{\tau’}[Z(s’_i, a’)] – \mathbb{E}_{\tau}[Z(s_i, a)] | $$
where \(Z(s,a)\) represents the distributional Q-value, and \(\mathbb{E}_{\tau}[Z]\) denotes the expectation under a distribution \(\tau\). The priority \(p_i\) is computed as:
$$ p_i = (\delta_i + \varepsilon)^\alpha $$
with \(\varepsilon\) being a small constant to prevent zero priorities, and \(\alpha\) controlling the emphasis on priorities. The sampling probability \(P(i)\) for transition \(i\) is then:
$$ P(i) = \frac{p_i}{\sum_j p_j} $$
This mechanism accelerates learning by focusing on surprising or significant experiences, which is crucial for UAV drones navigating dynamic obstacles.
For exploration, we replace the \(\epsilon\)-greedy strategy with noisy networks, which inject Gaussian noise into the neural network weights to encourage structured exploration. In a standard linear layer, the output \(y\) is computed as \(y = Wx + b\), where \(W\) and \(b\) are deterministic weights and biases. In noisy networks, these parameters are decomposed into learnable mean and noise components:
$$ W = \mu^W + \sigma^W \odot \varepsilon^W, \quad b = \mu^b + \sigma^b \odot \varepsilon^b $$
Here, \(\varepsilon^W\) and \(\varepsilon^b\) are sampled from independent Gaussian distributions, and \(\sigma^W, \sigma^b\) are noise parameters that modulate exploration intensity. The output becomes:
$$ y = (\mu^W + \sigma^W \odot \varepsilon^W) x + (\mu^b + \sigma^b \odot \varepsilon^b) $$
This approach enables more efficient exploration by perturbing the policy in a continuous manner, reducing the number of random, unguided actions that often plague UAV drone path planning in sparse reward settings. The combination of prioritized replay and noisy exploration forms our RNDQN algorithm, which improves sample efficiency and convergence speed for UAV drones.
The action space for the UAV drone is discretized to 26 possible movements in a 3D grid environment, allowing motion along combinations of steps in the x, y, and z directions. Formally, the action set \(A\) is defined as:
$$ A = \{ (dx, dy, dz) \mid dx, dy, dz \in \{-1, 0, 1\}, (dx, dy, dz) \neq (0, 0, 0) \} $$
This representation balances complexity and computational tractability, enabling the UAV drone to navigate freely in three dimensions while maintaining a manageable number of actions.
The state space \(S_t\) at time \(t\) encapsulates all necessary information for decision-making, including the UAV drone’s position, target location, and obstacle data. It is structured as:
$$ S_t = (P_{\text{drone}}^t, P_{\text{target}}^t, O_t) $$
where \(P_{\text{drone}}^t\) and \(P_{\text{target}}^t\) are 3D coordinates vectors, and \(O_t\) encodes obstacle positions, typically as a set of bounding boxes or grid occupancy. This state representation provides a comprehensive view of the environment, aiding the UAV drone in planning collision-free paths.
A critical component of our approach is the reward function, which guides the UAV drone toward optimal behavior. Traditional reward functions often rely on sparse signals, such as a large positive reward for reaching the goal and a negative penalty for collisions. However, this can lead to slow learning and poor path quality. To address this, we design a composite reward function based on artificial potential field concepts, with additional terms for smoothness and efficiency. The total reward \(R(s,a)\) is a sum of multiple components:
$$ R(s,a) = R_{\text{goal}} + R_{\text{fail}} + R_{\text{att}} + R_{\text{rep}} + R_{\text{time}} + R_{\text{dir}} $$
Each term is detailed below, with formulas tailored to encourage desirable behaviors for UAV drones.
The attraction reward \(R_{\text{att}}\) motivates the UAV drone to move toward the goal, with scaling based on distance to increase sensitivity near the target. It is defined as:
$$ R_{\text{att}} =
\begin{cases}
15 \cdot \Delta d & \text{if } \rho(q, q_{\text{goal}}) < 5 \\
10 \cdot \Delta d & \text{if } 5 \leq \rho(q, q_{\text{goal}}) < 20 \\
5 \cdot \Delta d & \text{otherwise}
\end{cases} $$
where \(\rho(q, q_{\text{goal}})\) is the Euclidean distance between the UAV drone’s position \(q\) and the goal \(q_{\text{goal}}\), and \(\Delta d = \rho_{\text{prev}} – \rho_{\text{current}}\) is the reduction in distance after taking an action. This graded reward structure helps mitigate reward sparsity and accelerates learning for UAV drones in distant regions.
The repulsion penalty \(R_{\text{rep}}\) discourages the UAV drone from approaching obstacles, incorporating a safety margin. For each obstacle \(i\) within a perception range \(d_{\text{rep}}\), the penalty is computed as:
$$ R_{\text{rep}} = \sum_{i=1}^{N}
\begin{cases}
-\frac{10}{d_i^2} & \text{if } d_i \leq d_{\text{rep}} \\
0 & \text{otherwise}
\end{cases} $$
Here, \(d_i\) is the shortest distance from the UAV drone to obstacle \(i\), and \(N\) is the number of obstacles within range. This term ensures that UAV drones maintain a safe distance from obstacles, reducing collision risks.
To promote smooth trajectories for UAV drones, we include a heading stability reward \(R_{\text{dir}}\), which penalizes sharp turns. Let \(\theta\) be the angle between the current movement direction and the previous direction. Then:
$$ R_{\text{dir}} =
\begin{cases}
-5 \cdot |\cos \theta| & \text{if } \cos \theta < 0 \\
0 & \text{otherwise}
\end{cases} $$
This discourages turns exceeding 90 degrees, leading to more natural and energy-efficient paths for UAV drones.
The time penalty \(R_{\text{time}}\) encourages the UAV drone to complete tasks quickly, with a progressive structure to avoid excessive penalties early on. It is given by:
$$ R_{\text{time}} = \beta_1 + \frac{\beta_2 \cdot t}{T} $$
where \(\beta_1\) and \(\beta_2\) are constants, \(t\) is the current time step, and \(T\) is the maximum allowed steps per episode. This term balances path length and time efficiency for UAV drones.
Finally, sparse rewards \(R_{\text{goal}} = 1000\) for reaching the goal and \(R_{\text{fail}} = -50\) for collisions or boundary violations provide clear success/failure signals. The composite reward function collectively addresses the limitations of traditional designs, offering dense guidance for UAV drone path planning.
We evaluate our RNDQN algorithm through simulations in a 3D grid environment, with obstacles modeled as axis-aligned bounding boxes. The experimental setup uses Python 3.8 with the Gym library for environment construction. Key hyperparameters are summarized in Table 1, which influences the learning dynamics of UAV drones.
| Parameter Symbol | Parameter Name | Value |
|---|---|---|
| \(\alpha\) | Learning rate | 0.001 |
| \(\beta_1\) | Minimum time penalty | 0.3 |
| \(\beta_2\) | Maximum time penalty | 10 |
| \(T\) | Max time steps per episode | 500 |
| \(\sigma\) | Noise standard deviation | 0.7 |
| \(W_1\) | Hidden layer 1 size | 128 |
| \(W_2\) | Hidden layer 2 size | 256 |
| \(W_3\) | Hidden layer 3 size | 256 |
| \(\gamma\) | Discount factor | 0.995 |
The obstacle configuration is detailed in Table 2, representing a cluttered 3D space typical of real-world UAV drone operations. Each obstacle is defined by its start and end coordinates in the grid.
| Obstacle ID | Start Coordinates (x,y,z) | End Coordinates (x,y,z) |
|---|---|---|
| 1 | (4,8,0) | (10,14,15) |
| 2 | (12,4,0) | (16,8,7) |
| 3 | (0,14,0) | (4,20,8) |
| 4 | (5,15,0) | (9,19,20) |
| 5 | (11,12,0) | (15,15,9) |
| 6 | (12,16,0) | (16,20,8) |
| 7 | (18,10,0) | (23,14,10) |
| 8 | (22,24,0) | (26,30,8) |
| 9 | (18,25,0) | (22,28,6) |
| 10 | (3,22,0) | (10,28,10) |
We compare RNDQN against baseline algorithms, including standard DQN and the Rapidly-exploring Random Tree (RRT) algorithm, which is a popular sampling-based method for UAV drone path planning. Performance metrics include path length, number of waypoints (a proxy for smoothness), and computation time. The start and goal positions are set to (2,2,6) and (20,20,7), respectively, simulating a challenging navigation task for UAV drones.
The training curves for DQN and RNDQN are analyzed to assess convergence. With standard DQN, the reward oscillates significantly between episodes 200 and 400 due to the \(\epsilon\)-greedy exploration strategy, stabilizing around episode 400. In contrast, RNDQN exhibits faster and more stable convergence, with rewards plateauing near episode 200. This improvement stems from prioritized experience replay, which focuses learning on critical transitions, and noisy exploration, which reduces ineffective random actions. These mechanisms collectively enhance the learning efficiency for UAV drones, as evidenced by the smoother reward progression.
Path planning results visually demonstrate the superiority of RNDQN. While all algorithms find feasible paths for the UAV drone, RNDQN produces shorter and smoother trajectories with fewer sharp turns. Quantitative comparisons are summarized in Table 3, based on 30 independent runs with randomized start positions and obstacle seeds to ensure statistical robustness. The results highlight the benefits of our approach for UAV drones.
| Algorithm | Path Length (mean ± std) | Number of Waypoints (mean ± std) | Time per Episode (s, mean ± std) |
|---|---|---|---|
| RRT | 44.16 ± 1.58 | 24 ± 1.3 | 0.42 ± 0.05 |
| DQN | 42.48 ± 1.46 | 22 ± 1.2 | 0.25 ± 0.03 |
| RNDQN | 37.72 ± 1.32 | 17 ± 1.2 | 0.22 ± 0.02 |
As shown, RNDQN reduces the average path length by 14.6% compared to RRT and 11.2% compared to DQN for UAV drones. Similarly, the number of waypoints decreases by 29.1% and 22.7%, respectively, indicating smoother paths with fewer abrupt direction changes. The computation time is also lower, underscoring the efficiency of RNDQN in real-time applications for UAV drones. These improvements align with the design goals of enhancing path quality and safety for UAV drones in complex environments.
To further validate the reward function, we analyze the contribution of each component. The attraction reward guides the UAV drone toward the goal early in training, while the repulsion penalty prevents collisions as the UAV drone navigates obstacles. The heading stability reward reduces erratic movements, and the time penalty encourages faster completion. Together, these elements address the sparse reward problem, providing continuous feedback that accelerates policy learning for UAV drones. This is particularly important in 3D spaces where UAV drones must balance multiple objectives.
In conclusion, we have presented an improved DQN-based algorithm, RNDQN, for UAV drone path planning in 3D environments. By integrating prioritized experience replay and noisy exploration, we boost learning efficiency and exploration effectiveness for UAV drones. The composite reward function, inspired by artificial potential fields, offers dense guidance that reduces path length and waypoints while ensuring obstacle avoidance. Experimental results demonstrate significant performance gains over RRT and standard DQN, with shorter, smoother, and safer paths for UAV drones. This work provides a robust framework for autonomous navigation of UAV drones, paving the way for applications in surveillance, delivery, and search-and-rescue missions. Future research could extend this approach to dynamic environments, multi-UAV drone coordination, and real-world deployment with sensor uncertainties, further advancing the capabilities of UAV drones in complex scenarios.
