Trajectory Planning for Firefighting UAV in Urban Fire Rescue Based on Improved Deep Q-Network

The rapid urbanization process has led to the formation of densely packed high-rise buildings and frequent traffic congestion, which are typical characteristics of modern cities. These factors significantly increase the difficulty of urban fire rescue operations. In particular, within older districts and commercial areas featuring dense construction, narrow roads, and haphazardly parked vehicles, traditional firefighting trucks are often unable to arrive promptly due to traffic jams and blocked fire lanes. Such delays in response can be catastrophic, especially in high-rise building fires. Urban fire rescue primarily faces three core challenges: firstly, fires in super-tall buildings exceed the operational height limits of ladder trucks; secondly, confined workspaces restrict the access of firefighting vehicles; and thirdly, complex traffic environments make it difficult to guarantee rescue timeliness. Firefighting Unmanned Aerial Vehicles (UAV drones), owing to their flexibility, rapid response capability, and ability to navigate around obstacles, are gradually becoming vital tools in high-rise fire rescue. They can circumvent traffic congestion and narrow streets to quickly reach the fire scene and execute fire suppression tasks.

Trajectory planning, as a key technology for the autonomous firefighting operations of UAV drones in urban fire rescue, is the core to enabling UAV drones to find safe and feasible flight paths quickly and accurately. The choice of algorithm directly determines flight efficiency, safety, and energy consumption. Commonly used algorithms in the field of path planning include Ant Colony Optimization (ACO), Rapidly-exploring Random Tree (RRT), Particle Swarm Optimization (PSO), Genetic Algorithm (GA), and Sparrow Search Algorithm (SSA). While these algorithms can effectively solve complex path planning problems to a certain extent, they still have limitations when dealing with the specific, urgent, and dynamically changing environment of urban fire rescue. For instance, traditional optimization algorithms like ACO and GA typically rely on global environmental information and have high computational complexity, making it difficult to meet the requirements for rapid real-time response in rescue scenarios. Although the RRT algorithm can adapt to complex obstacle environments, the paths it generates are often not smooth or optimal, making it challenging to ensure the stability and safety of UAV drone flight. Algorithms like PSO and SSA often require re-planning in the face of sudden dynamic obstacles, lacking sufficient timeliness. In contrast, deep reinforcement learning algorithms are particularly suitable due to their powerful online learning and adaptive decision-making capabilities. These algorithms can continuously optimize decision-making strategies through ongoing interaction with the environment, without relying on complete prior maps, and can effectively handle obstacles and unexpected situations that may arise during a rescue.

In this work, we propose an improved Deep Q-Network (IDQN)-based trajectory planning algorithm for firefighting UAV drones, specifically tailored for urban fire rescue scenarios. The UAV drone, carrying dry powder fire extinguishers or equipped with water-based fire extinguishing balls, can traverse dense high-rise buildings to achieve auxiliary fire suppression. Based on the environmental characteristics of urban high-rise building fire rescue, task requirements, and the performance parameters of firefighting UAV drones, we have made targeted improvements to the algorithm: a prioritized experience replay mechanism is introduced to enhance training efficiency; the state space and action space are redesigned to suit this specific scenario; and a composite reward function is constructed that primarily features an exponential time penalty, combined with obstacle avoidance penalties and goal-reaching rewards.

Modeling of the Firefighting UAV Drone Trajectory Planning Problem

Problem Description

Firefighting UAV drones primarily participate in high-rise building fire suppression in two modes: vertical takeoff from below the fire point, or taking off from the fire truck parking point and traversing building clusters. This study focuses on the latter mode. The goal is to plan an optimal flight trajectory from the parking point to the fire point under the premise of satisfying the UAV drone’s three-dimensional maneuverability constraints, while ensuring absolute safety and minimizing time. The UAV drone must navigate a complex urban canyon environment filled with static obstacles (buildings).

Constraints

The flight trajectory of a firefighting UAV drone in a rescue mission must consider multiple factors. The comprehensive constraints are as follows:

$$ \begin{cases}
\sqrt{\dot{x}(t)^2 + \dot{y}(t)^2} \leq V_{\text{max}} \\
0 \leq z(t) \leq h_{\text{max}} \\
t \leq T_{\text{endurance}} \\
\|\ddot{\mathbf{r}}(t)\| \leq a_{\text{max}} \\
\|\mathbf{r}(t) – \mathbf{r}_{\text{base}}\| \leq R_{\text{operation}} \\
\|\mathbf{r}(t) – \mathbf{o}_j\| \geq d_{\text{safe}}
\end{cases} $$

where $V_{\text{max}}$ is the maximum cruise speed, $h_{\text{max}}$ is the maximum cruise altitude, $T_{\text{endurance}}$ is the maximum endurance time, $a_{\text{max}}$ is the maximum acceleration, $R_{\text{operation}}$ is the maximum operational radius, and $d_{\text{safe}}$ is the minimum safe distance. The variable $t$ represents time, $T$ is the total mission time, $\mathbf{r}(t)$ is the position vector of the UAV drone at time $t$, $\mathbf{r}_{\text{base}}$ is the position vector of the starting point, and $\mathbf{o}_j$ is the position vector of the $j$-th obstacle.

Objective Function

The starting point of the firefighting UAV drone is the fire truck parking point, with coordinates $\mathbf{r}_{\text{start}} = (x_0, y_0, z_0)$. The endpoint is the fire point, with coordinates $\mathbf{r}_{\text{goal}} = (x_g, y_g, z_g)$. The UAV drone needs to sequentially pass through a set of predefined waypoints $(\mathbf{r}_1, \mathbf{r}_2, …, \mathbf{r}_n)$ and finally reach the fire point. The objective is to plan the shortest flight path length $L$ to minimize the total mission time $T$:

$$ \min L = \sum_{k=1}^{n} \|\mathbf{r}_{k+1} – \mathbf{r}_k\| $$

where $k$ is the index of the waypoint, and $\|\mathbf{r}_{k+1} – \mathbf{r}_k\|$ is the Euclidean distance between two consecutive waypoints.

Algorithm Design

Deep Reinforcement Learning Algorithm

Reinforcement learning (RL) is centered on an agent that actively interacts with its environment, learning through exploration and receiving rewards for its actions. Q-learning is a model-free, value-based method that learns an action-value function $Q(s, a)$, representing the expected cumulative reward for taking action $a$ in state $s$. The Q-learning update rule is:

$$ Q_{\text{new}}(s, a) = Q(s, a) + \alpha [r + \gamma \max_{a’} Q(s’, a’) – Q(s, a)] $$

where $s$ is the current state, $a$ is the current action, $s’$ is the next state, $a’$ is an action in the next state, $\alpha$ is the learning rate, $\gamma$ is the discount factor, and $r$ is the immediate reward.

Deep Reinforcement Learning combines the decision-making optimization capabilities of RL with the high-dimensional representation power of deep neural networks. The Deep Q-Network (DQN) algorithm enhances this by employing an experience replay mechanism to improve data efficiency and stabilize training, alongside a target network to reduce fluctuations in Q-value estimation, effectively overcoming the limitations of traditional Q-learning in high-dimensional state spaces.

Improved Deep Q-Network (IDQN) Algorithm

Our proposed improvements to the standard DQN focus on three key aspects tailored for the firefighting UAV drone rescue scenario.

1. Prioritized Experience Replay

Experience replay stores the agent’s interaction samples $(s, a, r, s’)$ in a replay buffer for later training. Prioritized Experience Replay improves upon uniform sampling by assigning each experience a priority based on its Temporal Difference (TD) error, $\delta_t$. Experiences with larger absolute TD error are more valuable for learning and are sampled more frequently. The TD-error is calculated as:

$$ \delta_t = r + \gamma \max_{a’} Q(s’, a’; \boldsymbol{\theta}_t^-) – Q(s, a; \boldsymbol{\theta}) $$

where $\boldsymbol{\theta}$ are the parameters of the current Q-network and $\boldsymbol{\theta}_t^-$ are the parameters of the target network. The sampling probability $P(i)$ for experience $i$ is proportional to its priority, $p_i^\sigma$, where $\sigma$ controls the prioritization strength. To correct for the bias introduced by non-uniform sampling, an importance-sampling weight $w_i$ is applied during the Q-update:

$$ P(i) = \frac{p_i^\sigma}{\sum_k p_k^\sigma}, \quad w_i = \left( \frac{1}{N \cdot P(i)} \right)^\beta $$

Here, $N$ is the replay buffer size and $\beta$ is a hyperparameter that anneals from 0 to 1.

2. State and Action Space Design

The state space $\mathcal{S}$ comprehensively describes the UAV drone’s situation at time $t$ and the relevant environmental information needed for decision-making. It is defined as:

$$ \mathcal{S}_t = \{ \mathbf{P}_t, \mathbf{G}_t, d_{t}^{o,0}, d_{t}^{o,1}, …, d_{t}^{o,k-1} \} $$

where $\mathbf{P}_t = (x_t, y_t, z_t)$ is the UAV drone’s current 3D position, $\mathbf{G}_t = (x_g, y_g, z_g)$ is the target (fire) position, and $d_{t}^{o,j} = (\Delta x^{o,j}, \Delta y^{o,j}, \Delta z^{o,j})$ represents the relative distance vector from the UAV drone to the $j$-th obstacle (for $j = 0, …, k-1$, with $k$ being the total number of perceived obstacles).

The action space is designed to allow full 3D mobility. The UAV drone can move one step (of a defined length) in any of 26 discrete directions: 9 directions in the plane above (inclined upward), 8 directions in the same horizontal plane, and 9 directions in the plane below (inclined downward). This provides the UAV drone with the flexibility to efficiently ascend, descend, and maneuver laterally around urban structures.

3. Reward Function Design

To meet the extreme timeliness requirements of urban fire rescue missions, we designed a composite reward function. This function combines an exponentially increasing time penalty with obstacle avoidance penalties and goal-reaching rewards to guide the UAV drone towards fast, safe, and goal-oriented behavior.

Flight Time Reward ($R_t$): A negative reward that grows exponentially with mission time encourages the UAV drone to find the fastest path. Based on comparative experiments, we set the parameters $\delta=0.01$ and $\mu=2$.

$$ R_t = -\delta \cdot t^\mu $$

Flight Safety (Collision) Penalty ($R_c$): A large negative penalty is given if the UAV drone’s next action would result in a collision with an obstacle, forcing it to stay in place or choose a safer action. A penalty of -50 was found effective.

$$ R_c = -50 \quad \text{(if collision)} $$

Goal-Directed Reward ($R_g$): This dense reward encourages movement towards the goal. It is based on the change in distance to the target before ($d_p$) and after ($d_n$) an action.

$$ R_g = d_p – d_n $$

Goal Achievement Reward ($R_{goal}$): A large positive reward is given upon successful arrival at the target fire point, incentivizing task completion. A reward of +1000 was selected.

$$ R_{goal} = +1000 \quad \text{(if goal reached)} $$

The total reward $R_{total}$ for a transition is the sum of the applicable components:

$$ R_{total} = R_t + R_c + R_g + R_{goal} $$

The effectiveness of key reward parameters was validated through ablation studies, as summarized in the tables below.

Table 1: Comparison of Results for Different Time Penalty Parameters $(\delta, \mu)$
Parameter Combination $(\delta, \mu)$	Average Mission Time (s)	Success Rate (%)
(0.100, 1)	5.7	83
(0.010, 2)	4.9	95
(0.001, 3)	5.5	91

Table 2: Comparison of Results for Different Collision Penalty Values
Collision Penalty Value	Collision Rate (%)	Average Mission Time (s)
-10	18.2	4.5
-50	5.3	4.6
-100	2.1	4.9

Table 3: Comparison of Results for Different Goal Achievement Reward Values
Goal Reward Value	Success Rate (%)	Average Mission Time (s)
+500	92	5.2
+1000	95	4.9
+2000	94	4.8

Simulation Experiments and Results Analysis

Simulation Setup and Platform

The simulation platform uses a computer with an R7-6800H 8-core processor, 16GB RAM, and an RTX 3050 GPU. Key algorithm parameters are set as: learning rate $\alpha = 10^{-3}$, discount factor $\gamma = 0.99$, initial exploration rate $\epsilon_0 = 0.9$ with a decay rate, hidden layer nodes of 256 and 128, replay buffer size of $10^6$, and priority exponent $\sigma = 0.8$.

The UAV drone model is based on the TH-MH-30 multi-rotor platform, selected for its stability, payload capacity (30 kg), endurance (30 min), speed (30 km/h), operational ceiling (200 m), and 5 km operational radius, making it suitable for demanding urban fire rescue tasks.

Simulation Environment Modeling

Two distinct 3D simulation environments were constructed based on real-world urban fire incidents to test the algorithm under different complexity levels.

Case 1 (Mingshang Xiyuan): Simulates a (180 × 180 × 150) m area with 8 buildings. The tallest building is 150 m high, alongside lower structures like an 18 m kindergarten. The UAV drone must navigate this cluster.

Case 2 (Jiaozhou Road Apartment): Simulates a more complex (150 × 180 × 200) m area containing 12 buildings, with the tallest reaching 200 m.

In both scenes, buildings are modeled as axis-aligned bounding boxes (AABB) based on real map data. The vertices of a building with origin $(x, y, z)$, length $l$, width $w$, and height $h$ are given by:

$$ \mathbf{V}_i = (x + l, \ y + w, \ z + h) $$

Simulation Results Analysis

1. Algorithm Training Performance

Both the standard DQN and our proposed IDQN algorithms were trained for 1000 episodes in each environment. The training curves for Case 2 are shown below (Case 1 exhibits similar trends). The IDQN algorithm demonstrates superior training characteristics: faster convergence (stable after ~200 episodes vs. ~400 for DQN), a more stable process without significant reward collapses, higher average episode rewards, and a smoother, more accurate increase in the predicted Q-value, indicating more effective learning.

2. Trajectory Planning Results

The planning performance of the IDQN algorithm is compared against the standard DQN and the sampling-based RRT algorithm. For each case, a start and goal point within the urban canyon were defined.

Case 1 Results: Start: (24, 20, 1) m, Goal: (140, 140, 90) m.

Table 4: Comparison of Algorithm Performance for Case 1
Algorithm	Path Length (m)	Number of Inflection Points	Planning/Execution Time (s)
RRT	321.34	18	10.041
DQN	259.30	10	4.803
IDQN (Proposed)	248.10	6	4.356

The IDQN algorithm generated the shortest and smoothest path. Compared to RRT and DQN, it reduced the path length by 22.78% and 4.32%, and the number of sharp turns by 66.67% and 40.00%, respectively.

Case 2 Results: Start: (30, 10, 1) m, Goal: (75, 150, 60) m.

Table 5: Comparison of Algorithm Performance for Case 2
Algorithm	Path Length (m)	Number of Inflection Points	Planning/Execution Time (s)
RRT	242.33	13	6.195
DQN	189.35	11	3.928
IDQN (Proposed)	180.79	4	3.756

In the more complex Case 2, the advantages of the IDQN algorithm are even more pronounced. It achieved a 25.40% and 4.52% reduction in path length, and a 69.23% and 63.64% reduction in inflection points compared to RRT and DQN, respectively.

Conclusion

This paper presents an improved Deep Q-Network (IDQN) algorithm for the trajectory planning of firefighting UAV drones in urban fire rescue scenarios. Building upon the traditional DQN framework, the proposed algorithm incorporates a prioritized experience replay mechanism to enhance learning efficiency. It features a redesigned state space that fuses positional and obstacle proximity information, an expansive 26-direction action space enabling full 3D mobility, and a critically designed composite reward function. This reward function uses an exponentially increasing time penalty as its core driver, combined with obstacle collision penalties and goal-oriented rewards, to explicitly optimize for the speed and safety required in emergency response.

Simulation experiments conducted in two distinct urban environments modeled after real fire incidents demonstrate the efficacy of the proposed approach. Compared to the standard DQN and the sampling-based RRT algorithm, the IDQN algorithm consistently plans shorter, smoother flight paths with significantly fewer sharp turns. This optimization directly translates to a reduction in the time required for the UAV drone to reach the fire point, thereby enhancing the timeliness and potential success rate of the critical initial response phase in urban fire rescue operations. The firefighting UAV drone, guided by this algorithm, can effectively navigate dense building clusters to deliver its payload. Future work will focus on extending this framework to handle dynamic environments and multi-UAV drone cooperative rescue missions.