In recent years, the rapid urbanization and construction of high-rise buildings have posed significant challenges for emergency response, particularly in firefighting scenarios. Urban high-rise fires are notoriously difficult to manage due to rapid spread, limited escape routes, and accessibility issues for traditional firefighting equipment. As a result, there is a growing interest in leveraging unmanned aerial vehicles (UAVs), or drones, to assist in fire detection and suppression. These fire UAVs offer a flexible and efficient solution for navigating complex urban environments, providing real-time data, and executing critical tasks. In this work, I address the problem of urban high-rise firefighting by formulating it as a partially observable Markov decision process (POMDP) and proposing a novel multi-agent reinforcement learning approach. Specifically, I develop a multi-agent proximal policy optimization (MAPPO) algorithm integrated with a β-variational autoencoder (β-VAE) to enable cooperative fire UAVs to autonomously navigate and complete firefighting missions in large-scale urban settings.
The core challenge lies in enabling multiple fire UAVs to collaborate effectively under partial observations, such as visual inputs from cameras, while avoiding obstacles and optimizing their paths to fire locations. Traditional methods often rely on predefined rules or centralized control, which may not scale well in dynamic and uncertain environments. Reinforcement learning (RL), particularly deep reinforcement learning (DRL), has shown promise in solving complex decision-making problems. However, multi-agent scenarios introduce additional complexities due to non-stationarity and coordination requirements. To tackle this, I adopt a centralized training with decentralized execution (CTDE) framework, where fire UAVs learn cooperative policies through shared experiences but act independently based on local observations. The integration of β-VAE helps process high-dimensional visual data, extracting latent features that enhance the learning efficiency of the MAPPO algorithm. Through simulations built on AirSim and UrbanScene3D, I demonstrate that our approach outperforms baseline methods like multi-agent deep deterministic policy gradient (MADDPG), achieving higher rewards and faster convergence in firefighting tasks.

To formalize the urban high-rise firefighting problem, I model it as a POMDP defined by the tuple $(N, S, O, A, R, P, \gamma)$. Here, $N$ represents the number of fire UAVs, $S$ is the set of global states, and each fire UAV $i \in \{1, 2, \dots, N\}$ receives a partial observation $o_i \in O(s, i)$ derived from the state $s \in S$. At each time step, fire UAV $i$ selects an action $a_i \in A$ according to its policy $\pi(a_i | \tau_i)$, where $\tau_i$ denotes its action-observation history. The joint action of all fire UAVs, $\mathbf{a} = \{a_1, \dots, a_N\}$, is executed in the environment, yielding a shared reward $R(s, \mathbf{a})$ and transitioning to a new state $s’$ via the state transition function $P(s’ | s, \mathbf{a})$. The discount factor $\gamma \in [0, 1]$ balances immediate and future rewards. In this context, fire UAVs must navigate from starting points to a fire zone in a 3D urban environment filled with obstacles like buildings. The fire zone is defined as a rectangular area with center $(x, y, z)$ and dimensions $(L, W, H)$, while fire UAVs operate at a fixed altitude $H_{\text{fix}}$ to simplify dynamics. The motion model for fire UAV $i$ is governed by linear and angular accelerations, as shown in the following equations:
$$v_{x,t+1}^i = v_{x,t}^i + \Delta v_{x,t}^i,$$
$$v_{y,t+1}^i = v_{y,t}^i + \Delta v_{y,t}^i,$$
$$x_{t+1}^i = x_t^i + v_{x,t+1}^i \times \Delta t,$$
$$y_{t+1}^i = y_t^i + v_{y,t+1}^i \times \Delta t,$$
subject to constraints: $v_{x,t}^i < v_{x,\text{max}}$, $v_{y,t}^i < v_{y,\text{max}}$, $\Delta v_{x,t}^i < \Delta v_{x,\text{max}}$, and $\Delta v_{y,t}^i < \Delta v_{y,\text{max}}$. Here, $v_{x,t}^i$ and $v_{y,t}^i$ denote the linear and angular velocities, respectively, while $\Delta v_{x,t}^i$ and $\Delta v_{y,t}^i$ are control actions (accelerations) that form the action space. Additionally, fire UAVs must maintain safe distances from obstacles and the fire zone. Let $D_t^i$ be the distance from fire UAV $i$ to the fire center at time $t$, and $l_i$ be the distance to the nearest obstacle. The constraints are:
$$D_t^i > D_{\text{min}}, \quad l_i > l_{\text{min}},$$
where $D_{\text{min}}$ ensures fire UAVs do not get too close to the fire for safety, and $l_{\text{min}}$ prevents collisions. Observations for each fire UAV consist of two parts: visual data from an onboard camera, encoded into a latent representation using β-VAE, and communication data from other fire UAVs, such as positions. These are combined into a feature vector that serves as input to the policy network. The state transition function $P(s’ | s, \mathbf{a})$ is determined by the environment, which is unknown to the fire UAVs, making this a challenging POMDP.
The proposed method centers on the MAPPO algorithm, which extends proximal policy optimization (PPO) to multi-agent settings. MAPPO is based on an actor-critic architecture, where each fire UAV has an actor network that outputs actions based on local observations, and a centralized critic network that evaluates joint actions using global information. This CTDE framework allows for stable training while enabling decentralized execution. The objective function for MAPPO is derived from the clipped surrogate objective of PPO, adapted for multiple agents. For a set of $N$ fire UAVs with shared parameters $\theta$, the policy loss $L(\theta)$ is defined as:
$$L(\theta) = \frac{1}{B n} \sum_{i=1}^{B} \sum_{k=1}^{n} \min\left( r_{k}^{\theta,i} A_{k}^{i}, \text{clip}(r_{k}^{\theta,i}, 1-\epsilon, 1+\epsilon) A_{k}^{i} \right),$$
where $B$ is the batch size, $n$ is the number of fire UAVs, $r_{k}^{\theta,i} = \frac{\pi_{\theta}(a_{k}^{i} | s_{k}^{i})}{\pi_{\theta_{\text{old}}}(a_{k}^{i} | s_{k}^{i})}$ is the probability ratio between current and old policies, $A_{k}^{i}$ is the advantage estimate for fire UAV $i$ at step $k$ computed using generalized advantage estimation (GAE), and $\epsilon$ is a clipping hyperparameter (typically 0.2). The critic network, parameterized by $\phi$, minimizes the value loss $L(\phi)$:
$$L(\phi) = \frac{1}{B n} \sum_{i=1}^{B} \sum_{k=1}^{n} \max\left[ (V_{\phi}(s_{k}^{i}) – \hat{R}_{i})^{2}, \left( \text{clip}(V_{\phi}(s_{k}^{i}), V_{\phi_{\text{old}}}(s_{k}^{i}) – \epsilon, V_{\phi_{\text{old}}}(s_{k}^{i}) + \epsilon) – \hat{R}_{i} \right)^{2} \right],$$
where $\hat{R}_{i}$ is the discounted return. To handle high-dimensional visual observations, I incorporate a β-VAE, which is a variant of variational autoencoders designed to learn disentangled latent representations. The β-VAE is trained separately on RGB or depth images from the environment, compressing them into a latent code of length 30. This code captures essential features like obstacle edges or fire signatures, reducing the input dimensionality for the MAPPO algorithm and accelerating convergence. The β-VAE objective includes a KL divergence term weighted by $\beta$ to encourage disentanglement:
$$\mathcal{L}_{\beta\text{-VAE}} = \mathbb{E}_{q(z|x)}[\log p(x|z)] – \beta \, D_{\text{KL}}(q(z|x) \| p(z)),$$
where $x$ is the input image, $z$ is the latent variable, $q(z|x)$ is the encoder, $p(x|z)$ is the decoder, and $p(z)$ is a prior (usually Gaussian). After pre-training, the encoder extracts latent codes that are concatenated with other observations (e.g., positions) to form the full state for each fire UAV.
A critical aspect of reinforcement learning is reward design, especially in sparse-reward environments like firefighting. To guide fire UAVs effectively, I design a composite reward function $R_t^i$ for each fire UAV $i$ at time $t$, consisting of three components: mission completion, distance reduction, and obstacle avoidance. First, the mission reward $R_1^i$ incentivizes timely arrival at the fire zone, based on the “golden three minutes” concept in firefighting:
$$R_1^i =
\begin{cases}
\frac{240}{t}, & 0 < t \leq 30, \\
\frac{180}{t}, & 30 < t \leq 180, \\
-1, & t > 180,
\end{cases}$$
where $t$ is the time step since the mission started. This encourages fire UAVs to reach the fire quickly, with higher rewards for earlier arrival. Second, the distance reward $R_2^i$ promotes continuous progress toward the fire zone:
$$R_2^i = \frac{|D_{t-1}^i – D_t^i|}{D_0^i},$$
subject to constraints that the fire UAV remains outside the fire zone until completion: $x_t^i > x + \frac{1}{2}L$ or $x_t^i < x – \frac{1}{2}L$, and similarly for $y$ and $z$. Here, $D_t^i$ is the Euclidean distance to the fire center, and $D_0^i$ is the initial distance. This reward normalizes progress to account for varying starting points. Third, the safety reward $R_3^i$ prevents collisions with obstacles:
$$R_3^i =
\begin{cases}
m_t^i, & \text{no collision}, \\
-1, & \text{collision},
\end{cases}$$
where $m_t^i$ is the normalized mean depth value from the fire UAV’s depth map, indicating proximity to obstacles. The total reward for fire UAV $i$ is a weighted sum:
$$R_t^i = \alpha R_1^i + \sigma R_2^i + \delta R_3^i,$$
with weights $\alpha = 1$, $\sigma = 5$, and $\delta = 1$ to emphasize distance reduction while balancing mission and safety. This reward structure provides dense feedback, helping fire UAVs learn efficient policies despite the sparse nature of the ultimate goal (extinguishing fires).
For evaluation, I developed a high-fidelity simulation environment using AirSim and UrbanScene3D, which models a large urban area of approximately 7.4 km² with realistic buildings, streets, and fire effects. The environment supports multiple fire UAVs with physics-based dynamics and sensor simulations (e.g., cameras for RGB and depth images). Fire zones are represented as volumetric regions with particle effects for flames and smoke, requiring fire UAVs to navigate through narrow passages and avoid dynamic obstacles. The simulation runs at a time step of $\Delta t = 0.1$ seconds, and fire UAVs have a maximum linear speed of 20 m/s and maximum linear acceleration of 2 m/s². I conducted experiments with $N = 2$ and $N = 3$ fire UAVs to assess scalability and cooperation. The MAPPO algorithm, integrated with β-VAE, was compared against two baselines: MAPPO without β-VAE and MADDPG with β-VAE. All algorithms used the same network architecture: three fully connected layers with 512 units each, ReLU activations, and a recurrent layer for handling partial observability. Key hyperparameters are summarized in the table below.
| Parameter | Value |
|---|---|
| Total time steps | 1,500,000 |
| Discount factor $\gamma$ | 0.99 |
| Episode length | 1800 |
| Time step $\Delta t$ | 0.1 s |
| Learning rate | 5 × 10⁻⁴ |
| Number of fully connected layers | 3 |
| Layer dimension | 512 |
| Reward weights $\alpha, \sigma$ | 1, 5 |
| β-VAE weight $\beta$ | 5 |
| Buffer size (MADDPG) | 5000 |
| Target network update rate (MADDPG) | 0.001 |
The training process involved pre-training the β-VAE for 20,000 steps on a dataset of environment images, after which the latent codes were fed into the MAPPO networks. For MAPPO, fire UAVs collected trajectories of length 1800 steps per episode, updated policies every batch using Adam optimizer. In the case of $N=2$ fire UAVs, starting points were set at coordinates (0, 0) and the fire zone at (-224, 11), with a success condition requiring both fire UAVs to reach within 40 m of the fire center while maintaining a minimum separation of 10 m. The learned policies enabled fire UAVs to navigate around buildings and converge to the target, as shown in the paths where fire UAV 1 reached (-187.20, 7.62) and fire UAV 2 reached (-188.75, 22.46), with distances of 37.07 m and 36.95 m from the fire, respectively, and an inter-UAV distance of 14.92 m. For $N=3$, starting points were similar, and all three fire UAVs successfully reached the fire zone at (-115, -55) with coordinates (-110.35, -21.61), (-100.75, -25.98), and (-125.95, -17.19), meeting distance and separation constraints. These results demonstrate that MAPPO with β-VAE can solve complex multi-fire UAV path planning in 3D environments.
To quantify performance, I plotted average return curves over training steps for the three algorithms: MAPPO with β-VAE, MAPPO without β-VAE, and MADDPG with β-VAE. The curves reveal that MAPPO with β-VAE achieved the highest asymptotic reward, converging around 900,000 steps with an average return significantly above the others. In contrast, MAPPO without β-VAE failed to converge within the 1.5 million steps, highlighting the importance of visual feature extraction. MADDPG with β-VAE converged faster (around 700,000 steps) but plateaued at a lower reward, indicating suboptimal policies for firefighting tasks. The reward components were analyzed to understand fire UAV behavior: distance reward $R_2^i$ increased steadily as fire UAVs approached the fire, while safety reward $R_3^i$ remained positive due to effective obstacle avoidance. The mission reward $R_1^i$ spiked upon successful arrival, but its sparse nature required the dense incentives from $R_2^i$ and $R_3^i$ for stable learning. These findings underscore the efficacy of our reward design and the synergy between β-VAE and MAPPO in handling partial observability.
Several ablation studies were conducted to validate design choices. For instance, removing the β-VAE and using raw images directly led to unstable training and poor convergence, as the high-dimensional input overwhelmed the policy networks. Similarly, replacing MAPPO with independent PPO (where each fire UAV learns separately) resulted in uncoordinated behaviors and frequent collisions, emphasizing the need for centralized critics in multi-fire UAV settings. The choice of $\beta = 5$ in β-VAE was optimized through grid search, balancing reconstruction accuracy and latent disentanglement; lower values caused blurry features, while higher values degraded reconstruction. Additionally, the reward weights were tuned via trial-and-error, with $\sigma = 5$ proving critical for encouraging rapid movement toward the fire zone. The simulation environment also allowed testing under varying conditions, such as dynamic obstacles (e.g., moving vehicles) or multiple fire zones, where our approach maintained robust performance due to the generalization capability of deep reinforcement learning.
In conclusion, I have presented a comprehensive framework for urban high-rise firefighting using cooperative fire UAVs based on multi-agent reinforcement learning. By formulating the problem as a POMDP and developing a MAPPO algorithm enhanced with β-VAE, I enable fire UAVs to learn efficient navigation and collaboration strategies from visual and communication data. The reward function combines mission completion, distance reduction, and safety to provide dense guidance in a sparse-reward setting. Simulations in a large-scale urban environment demonstrate that our method outperforms baseline algorithms like MADDPG in terms of convergence speed and final performance, with fire UAVs successfully reaching fire zones while avoiding obstacles. Future work could explore integrating simultaneous localization and mapping (SLAM) techniques for better state estimation, extending to heterogeneous fire UAV teams (e.g., with different sensors or capabilities), and testing in real-world scenarios with physical drones. The proposed approach holds promise for enhancing emergency response in complex urban settings, leveraging the agility and autonomy of fire UAVs to save lives and property.
The potential applications of this technology extend beyond firefighting to other disaster response tasks, such as search and rescue or hazardous material handling. As fire UAVs become more advanced, their integration with AI and reinforcement learning will likely revolutionize how we manage urban emergencies. Continued research in multi-agent systems, computer vision, and simulation-to-real transfer will be key to deploying these systems safely and effectively. Through this work, I aim to contribute to the growing body of knowledge on intelligent fire UAVs, fostering innovation in autonomous systems for public safety.
