Urban Low-Altitude UAV Path Planning Based on Reinforcement Learning

With the emergence of low-altitude economies, low-altitude airspace has become increasingly pivotal for urban transportation revolution. Unmanned aerial vehicles (UAVs) and other novel transportation modes are being used increasingly in urban spaces; however, they face challenges such as path planning, dynamic obstacle avoidance, and multi-UAV coordination in complex low-altitude environments. To address these challenges, we optimize UAV path-planning algorithms to enhance the path-planning and obstacle-avoidance capabilities of low altitude drones in complex environments while ensuring efficient and safe multi-UAV collaborative operations.

Literature Review

Existing path-planning methods face significant limitations in low-altitude urban environments. Traditional algorithms like Dijkstra and A* struggle with dynamic obstacles and high-dimensional spaces, while Artificial Potential Field (APF) methods suffer from local optima. Intelligent bionic algorithms such as neural networks and genetic algorithms require extensive training data and lack real-time adaptability. Although recent reinforcement learning approaches show promise in single-agent scenarios, they remain inadequate for multi-UAV coordination in dynamic low-altitude airspace. Our review identifies three critical gaps:

Method Category	Representative Algorithms	Limitations in Low-Altitude UAV Context
Traditional Methods	Dijkstra, A*, APF, RRT	Static environment assumptions; poor scalability; local optima
Intelligent Bionic Algorithms	Neural Networks, Ant Colony, Genetic Algorithms	High computational load; data dependency; slow convergence
Reinforcement Learning	DQN, DDQN, Dueling DQN	Single-agent focus; inadequate multi-UAV coordination

Current research fails to adequately address the unique requirements of low altitude UAV operations, where wind dynamics, 3D obstacle avoidance, and fleet coordination must be simultaneously optimized. This necessitates novel algorithms capable of handling these multidimensional constraints.

Methodology

Kinematic Model

We establish a 6-DOF kinematic model for low altitude drones operating in urban wind fields. The global ($\xi, \eta, \zeta$) and UAV-frame ($u, v, w$) coordinates are related through Euler angle transformations:

$$
\begin{bmatrix}
\dot{x} \\
\dot{y} \\
\dot{z} \\
\dot{\psi} \\
\dot{\theta} \\
\dot{\phi}
\end{bmatrix} =
\begin{bmatrix}
c\psi c\theta & s\psi s\theta c\phi – s\psi c\phi & s\psi s\theta s\phi + c\psi s\phi & 0 & 0 & 0 \\
s\psi c\theta & c\psi s\theta c\phi + s\psi s\phi & c\psi s\theta s\phi – s\psi c\phi & 0 & 0 & 0 \\
-s\theta & c\theta s\phi & c\theta c\phi & 0 & 0 & 0 \\
0 & 0 & 0 & 1 & 0 & 0 \\
0 & 0 & 0 & 0 & c\phi & -s\phi \\
0 & 0 & 0 & 0 & s\phi & c\phi
\end{bmatrix}
\begin{bmatrix}
u \\
v \\
w \\
p \\
q \\
r
\end{bmatrix}
$$

The dynamics follow Newton-Euler formulations:

$$
\mathbf{M(q)\ddot{q} + C(q,\dot{q})\dot{q} + g(q) = \tau}
$$

where $\mathbf{M}$ denotes the mass matrix, $\mathbf{C}$ captures Coriolis effects, $\mathbf{g}$ represents gravity forces, and $\mathbf{\tau}$ is the torque vector.

NP-MTDDQN Algorithm

Our NP-MTDDQN algorithm integrates three innovations to address low altitude UAV path planning challenges:

1. N-step Updating: Expands the temporal horizon for value estimation to reduce myopic decisions:

$$
G_t^{(N)} = R_{t+1} + \gamma R_{t+2} + \cdots + \gamma^{N-1}V(s_{t+N})
$$

2. Enhanced Prioritized Experience Replay (PER): Dynamically balances positive/negative experiences to accelerate convergence. Let $E_{\text{positive}}$ denote positive experiences (reaching targets) and $E_{\text{total}}$ represent total experiences. The enhanced mechanism maintains an optimal positive experience ratio:

$$
E’_{\text{positive}} = \alpha \left( \frac{E_{\text{positive}}}{E_{\text{total}} – E_{\text{positive}}} \right) \quad (\alpha = 2.0)
$$

3. Multi-Target Coordination: Optimizes fleet-level objectives through joint state-action formulation:

$$
\min_\pi \left[ \sum_{i=1}^N d_i(s_i, a_i) + \sum_{i=1}^N \mathbb{E}[I_i(\mathbf{s}, \mathbf{a})] \right]
$$

where $d_i$ measures task completion time and $I_i$ quantifies path interference penalties.

The network architecture (Figure 1) processes 3D positional data and wind vectors through convolutional layers, with dueling streams separating value and advantage estimation.

Visualization of low-altitude drone path planning in urban wind field environment

Experiments

Environment Setup

We constructed a 2km × 2km × 100m urban airspace using wind field data from Jinan, China. The environment features:

Component	Specification
Buildings	4-32 static obstacles with priority levels
Dynamic Obstacles	4-8 mobile agents simulating other UAVs
Wind Vectors	Directional flow affecting movement efficiency

The reward function combines multiple objectives:

$$
R = \begin{cases}
500 & \text{Target reached} \\
\beta_1 \Delta s + \beta_2 \|\mathbf{p}_a – \mathbf{p}_t\|^2 + \cos|\phi – \phi’| + \\
\beta_3 \sum \|\mathbf{p}_a – \mathbf{B}_i\|^2 + \beta_4 \text{collision}_{\text{building}} + \\
\beta_5 \sum \|\mathbf{p}_a – \mathbf{C}_j\|^2 + \beta_6 \text{collision}_{\text{UAV}} & \text{Otherwise}
\end{cases}
$$

with parameters $\beta_1 = -0.01$, $\beta_2 = -0.1$, $\beta_3 = \beta_5 = 0.001$, $\beta_4 = \beta_6 = 10$.

Results

Obstacle Density Tests: Our NP-MTDDQN consistently outperformed baselines across building densities (Table 1). The convergence advantage stems from prioritized experience balancing and multi-step updates.

Table 1: Performance comparison in varying obstacle densities
Algorithm	4 Buildings (Reward)	32 Buildings (Reward)	Convergence Speed
DQN	182.3 ± 15.7	71.6 ± 22.4	Slow with oscillations
DDQN	210.5 ± 12.8	98.3 ± 18.2	Moderate
NP-MTDDQN	315.2 ± 8.3	283.7 ± 9.6	Fastest (40% faster than DDQN)

Priority Obstacle Avoidance: When navigating among 32 buildings (8 high-priority), our algorithm achieved 97% avoidance of critical structures versus 63% for DDQN. The joint reward formulation effectively penalized proximity to high-risk zones:

$$
R_{\text{priority}} = \begin{cases}
-100 & \text{High-priority collision} \\
-10 & \text{Low-priority collision}
\end{cases}
$$

Dynamic Obstacle Tests: With 8 moving obstacles, NP-MTDDQN maintained 92% mission success versus DDQN’s 67%. The N-step mechanism enabled anticipatory maneuvers critical for low altitude drone safety in congested airspace.

Cross-City Validation

Generalizability was verified in Xi’an (NE wind, 1.8m/s) and Kunming (SW wind, 3.6m/s) environments. Despite divergent wind patterns, our algorithm achieved stable convergence (Figure 2), proving adaptable to diverse low altitude UAV operational scenarios.

Conclusion

Our NP-MTDDQN algorithm advances low altitude UAV path planning through three key innovations: 1) N-step temporal extension for long-horizon decision-making, 2) dynamically balanced experience replay focusing on critical learning events, and 3) multi-UAV coordination optimizing fleet-level objectives. Validated in realistic urban wind fields with static/dynamic obstacles, the algorithm demonstrates superior efficiency, safety, and scalability compared to state-of-the-art methods. Future work will integrate real-time air traffic control data to enhance applicability in metropolitan low-altitude drone networks.