Urban High-Rise Firefighting with Multi-Agent Proximal Policy Optimization for Fire Drones

In recent years, the rapid urbanization and construction of skyscrapers have posed significant challenges for emergency response, particularly in urban high-rise firefighting. Fires in tall buildings can spread quickly, with limited escape time and difficulties in accessing upper floors, making traditional firefighting methods inefficient and risky. To address this, we explore the use of fire drones—unmanned aerial vehicles (UAVs) equipped with cameras and firefighting capabilities—as a viable solution. These fire drones can navigate complex urban environments, locate fires, and provide real-time data or even assist in extinguishing flames. However, coordinating multiple fire drones in such scenarios requires advanced algorithms to handle partial observability, obstacle avoidance, and collaborative task completion. In this work, we formulate the urban high-rise firefighting problem as a Partially Observable Markov Decision Process (POMDP) and propose a Multi-Agent Proximal Policy Optimization (MAPPO) algorithm enhanced with a β-Variational Auto-Encoder (β-VAE) to enable efficient navigation and collaboration among fire drones. Our approach leverages deep reinforcement learning (DRL) to train fire drones to autonomously reach fire zones while avoiding obstacles and optimizing their paths based on visual inputs and shared information. We build a large-scale, realistic simulation environment using AirSim and UrbanScene3D to evaluate our method, comparing it with Multi-Agent Deep Deterministic Policy Gradient (MADDPG) to demonstrate its effectiveness. Through extensive experiments, we show that our MAPPO-based approach outperforms alternatives in terms of convergence speed and task performance, highlighting its potential for real-world firefighting applications.

The integration of fire drones into urban firefighting systems represents a promising advancement, as they can overcome limitations of ground-based teams, such as access restrictions and safety hazards. Fire drones are capable of aerial surveillance, thermal imaging, and even carrying fire suppressants, making them versatile tools. However, deploying multiple fire drones in dense urban areas requires robust coordination algorithms to handle dynamic environments and limited sensor data. Reinforcement learning (RL) has emerged as a powerful framework for such multi-agent systems, but sparse rewards and high-dimensional observations—like images from drone cameras—pose challenges. To tackle this, we incorporate β-VAE to compress visual inputs into latent representations, reducing computational complexity and improving learning efficiency. Our MAPPO algorithm, built on an Actor-Critic architecture, uses a centralized critic with global information during training and decentralized actors for execution, allowing fire drones to cooperate effectively. This paper details our methodology, experimental setup, and results, emphasizing the role of fire drones in enhancing urban safety.

Urban fire incidents, especially in high-rise buildings, account for substantial property damage and loss of life worldwide. According to fire safety reports, response times are critical, with the “golden three minutes” concept emphasizing the need for rapid intervention to control fires before they escalate. Traditional methods rely on human firefighters and ladder trucks, which may be impeded by traffic, building height, or structural damage. Fire drones offer a complementary solution by providing aerial perspectives and early detection. Recent studies have explored computer vision techniques for fire detection using drones, such as infrared sensors and convolutional neural networks, but few address the navigation and coordination aspects in complex 3D environments. Our work bridges this gap by focusing on multi-agent path planning for fire drones using DRL, enabling autonomous firefighting missions. We define the problem as a POMDP to account for partial observations—each fire drone only perceives a limited view of the environment through its camera and communicates with others to share data. The goal is to guide a fleet of fire drones from starting points to a fire zone, avoiding obstacles like buildings, while maximizing rewards for timely arrival and collision avoidance.

We model the fire drone system as a multi-agent environment with N fire drones. Each fire drone i operates in a 3D space with coordinates (x_i, y_i, z_i) and follows a dynamic model based on kinematic equations. For simplicity, we assume fire drones fly at a fixed height H_fix to reduce complexity, and control commands are executed without delay. The state of the environment includes the positions of all fire drones, obstacles, and the fire zone, but each fire drone only receives partial observations o_i, which consist of processed visual data from its camera and communicated information from other fire drones. The action space for each fire drone comprises linear and angular accelerations, allowing continuous control. The state transition function P(s’|s,a) is determined by the simulation environment, which is unknown to the fire drones, requiring them to learn through interaction.

The dynamic model for fire drone i is given by the following equations, which describe how velocities and positions update over time steps Δt:

$$ v_{x,t+1}^i = v_{x,t}^i + \Delta v_{x,t}^i $$

$$ v_{y,t+1}^i = v_{y,t}^i + \Delta v_{y,t}^i $$

$$ x_{t+1}^i = x_t^i + v_{x,t+1}^i \times \Delta t $$

$$ y_{t+1}^i = y_t^i + v_{y,t+1}^i \times \Delta t $$

subject to constraints: $ v_{x,t}^i < v_{x,\text{max}} $, $ v_{y,t}^i < v_{y,\text{max}} $, $ \Delta v_{x,t}^i < \Delta v_{x,\text{max}} $, and $ \Delta v_{y,t}^i < \Delta v_{y,\text{max}} $. Here, $ v_{x,t}^i $ and $ v_{y,t}^i $ represent the linear and angular velocities, respectively, and $ \Delta v_{x,t}^i $ and $ \Delta v_{y,t}^i $ are the accelerations that serve as control actions. Additionally, fire drones must maintain a minimum distance D_min from the fire zone to perform observation tasks and avoid obstacles by keeping a distance l_i > l_min from any building. These constraints ensure safe and practical operation of fire drones in urban settings.

The observation for each fire drone combines visual and communicative data. We use a β-VAE to encode RGB or depth images from the drone’s camera into a latent vector of length 30, capturing essential features like fire signatures and obstacles. This latent code is then concatenated with positional data from other fire drones to form the full observation o_i. This approach reduces the dimensionality of visual inputs, making it easier for the DRL algorithm to process and learn from high-dimensional data. The β-VAE is trained separately for 20,000 steps using reconstruction loss, ensuring it can generate meaningful representations for navigation. By integrating β-VAE, we enable fire drones to efficiently interpret their surroundings, which is crucial for tasks like identifying fire zones and navigating narrow gaps between buildings.

Our MAPPO algorithm extends the Proximal Policy Optimization (PPO) framework to multi-agent settings, employing a Centralized Training with Decentralized Execution (CTDE) paradigm. During training, a centralized critic network evaluates actions based on global state information, while each fire drone uses an actor network with shared parameters to make independent decisions. This allows fire drones to cooperate without requiring full observability at execution time. The objective function for MAPPO involves clipping the policy ratio to ensure stable updates, as shown below:

$$ L^{\text{CLIP}}(\theta) = \mathbb{E}_t \left[ \min\left( r_t(\theta) \hat{A}_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) \hat{A}_t \right) \right] $$

where $ r_t(\theta) = \frac{\pi_\theta(a_t | s_t)}{\pi_{\theta_{\text{old}}}(a_t | s_t)} $ is the probability ratio between current and old policies, $ \hat{A}_t $ is the estimated advantage function using Generalized Advantage Estimation (GAE), and $ \epsilon $ is a hyperparameter. For multiple fire drones, we average over agents and batches:

$$ L(\theta) = \frac{1}{Bn} \sum_{i=1}^B \sum_{k=1}^n \left[ \min\left( r_{k,\theta}^i A_k^i, \text{clip}(r_{k,\theta}^i, 1-\epsilon, 1+\epsilon) A_k^i \right) \right] $$

Here, B is the batch size, n is the number of fire drones, and $ A_k^i $ is the advantage for fire drone i. The critic network is trained to minimize the value loss:

$$ L(\phi) = \frac{1}{Bn} \sum_{i=1}^B \sum_{k=1}^n \max\left[ (V_\phi(s_k^i) – \hat{R}_i)^2, \left( \text{clip}(V_\phi(s_k^i), V_{\phi_{\text{old}}}(s_k^i) – \epsilon, V_{\phi_{\text{old}}}(s_k^i) + \epsilon) – \hat{R}_i \right)^2 \right] $$

where $ \hat{R}_i $ is the discounted reward. This clipping mechanism prevents large policy updates, enhancing training stability for fire drones in complex environments.

To incentivize fire drones to complete firefighting tasks efficiently, we design a reward function with three components. The primary reward $ R_1^i $ encourages timely arrival at the fire zone, based on the golden three minutes concept:

$$ R_1^i = \begin{cases}
\frac{240}{t}, & 0 < t \leq 30 \\
\frac{180}{t}, & 30 < t \leq 180 \\
-1, & t > 180
\end{cases} $$

where t is the time step. This sparse reward structure provides high rewards for early arrival, simulating the urgency of fire response. The secondary reward $ R_2^i $ promotes proximity to the fire zone:

$$ R_2^i = \frac{|d_i(t-1) – d_i(t)|}{d_i(0)} $$

subject to constraints that the fire drone remains outside a bounding box around the fire zone until completion. Here, $ d_i(t) $ is the distance from fire drone i to the fire center at time t. This dense reward helps guide fire drones toward the target. The tertiary reward $ R_3^i $ ensures obstacle avoidance:

$$ R_3^i = \begin{cases}
m_t^i, & \text{no collision} \\
-1, & \text{collision}
\end{cases} $$

where $ m_t^i $ is the normalized mean of the depth map from the fire drone’s camera, indicating distance to obstacles. The total reward for fire drone i is a weighted sum:

$$ R_t^i = \alpha R_1^i + \sigma R_2^i + \delta R_3^i $$

with weights α, σ, and δ set to 1, 5, and 1 respectively in our experiments. This composite reward balances task completion, navigation, and safety for fire drones.

We implement our approach in a high-fidelity simulation environment built on AirSim and UrbanScene3D, covering approximately 7.4 km² of urban terrain with realistic buildings, streets, and fire effects. The environment includes multiple fire zones and obstacles to test fire drone navigation under challenging conditions. We use three fire drones (N=3) for most experiments, each equipped with a camera capturing RGB and depth images at 84×84 resolution. The β-VAE is trained beforehand to encode these images into latent vectors, which are fed into the MAPPO algorithm along with positional data. The network architecture for both actor and critic consists of three fully connected layers with 512 units each, using ReLU activations. Training parameters are summarized in Table 1.

Table 1: Algorithm Parameters for Fire Drone Training
Parameter	Value
Total Time Steps (step_max)	1,500,000
Discount Factor (γ)	0.99
Episode Length	1800
Time Step Interval (Δt)	0.1 s
Learning Rate (lr)	5e-4
Number of Fully Connected Layers	3
Layer Dimension	512
Reward Weights (α, σ)	1, 5
β for VAE	5
Maximum Fire Drone Velocity	20 m/s
Maximum Acceleration	2 m/s²
Clipping Parameter (ε)	0.2

We compare our MAPPO algorithm with β-VAE against two baselines: MAPPO without β-VAE and MADDPG with β-VAE. MADDPG uses a replay buffer of size 5000 and a target network update rate of 0.001. All algorithms are trained for 1.5 million steps, and performance is evaluated based on average return, success rate, and convergence speed. Success is defined as all fire drones reaching within 40 meters of the fire zone while maintaining a minimum separation of 10 meters to avoid collisions and ensure effective observation. We conduct experiments with varying numbers of fire drones and starting positions to assess scalability and robustness.

The results demonstrate that our MAPPO algorithm with β-VAE achieves superior performance. For instance, in a scenario with two fire drones starting at coordinates (0,0) and targeting a fire zone at (-224,11), both fire drones successfully navigate around obstacles and reach the target within the required distance, as shown in Table 2. The paths are collision-free and optimized for speed, highlighting the effectiveness of our reward function in guiding fire drones. Similarly, with three fire drones starting at (0,0) and targeting (-115,-55), all fire drones complete the mission, taking different routes to avoid obstacles and each other. This showcases the collaborative capabilities of our multi-agent approach for fire drones.

Table 2: Path Planning Results for Two Fire Drones
Fire Drone ID	Final Coordinates (m)	Distance to Target (m)	Distance to Other Fire Drone (m)
1	(-187.20, 7.62)	37.07	14.92
2	(-188.75, 22.46)	36.95	14.92

The convergence curves in Figure 1 (represented by the description below) indicate that MAPPO with β-VAE converges around 900,000 steps with a higher average return compared to baselines. In contrast, MAPPO without β-VAE fails to converge within the step limit, emphasizing the importance of visual compression for fire drones. MADDPG with β-VAE converges faster initially but plateaus at a lower return, suggesting that MAPPO’s policy clipping leads to better optimization for firefighting tasks. The average return for MAPPO with β-VAE reaches approximately 120, while MADDPG attains around 100, demonstrating a 20% improvement in task performance. These results validate that our algorithm enables fire drones to learn efficient navigation strategies in complex urban environments.

We analyze the training process in detail. The β-VAE reduces the dimensionality of visual inputs from 84x84x3 images to 30-dimensional vectors, cutting computational cost by over 90% while preserving key features like fire colors and obstacle edges. This allows the MAPPO algorithm to focus on high-level decision-making rather than raw pixel processing. During training, fire drones initially explore randomly but gradually learn to approach the fire zone by maximizing rewards. The reward components play distinct roles: $ R_1^i $ drives urgency, $ R_2^i $ guides movement, and $ R_3^i $ prevents crashes. Ablation studies show that removing any component degrades performance, especially $ R_3^i $, as fire drones often collide without obstacle avoidance incentives. This underscores the need for balanced reward design in fire drone applications.

To further evaluate scalability, we test with up to five fire drones in larger environments. The results, summarized in Table 3, indicate that our MAPPO algorithm maintains high success rates even with more agents, though training time increases linearly. This scalability is crucial for real-world firefighting, where multiple fire drones may be deployed simultaneously. We also vary environmental factors like wind and fire spread, modeled as stochastic disturbances in the simulation. Our algorithm adapts well, with fire drones adjusting paths dynamically, showcasing robustness. However, challenges remain in extremely dense obstacle fields, where fire drones sometimes get trapped; future work could incorporate hierarchical planning to address this.

Table 3: Performance with Increasing Number of Fire Drones
Number of Fire Drones	Success Rate (%)	Average Time to Target (s)	Collision Rate (%)
2	98.5	45.2	1.2
3	96.8	52.7	2.5
4	94.3	60.1	3.8
5	91.0	68.9	5.1

In terms of computational efficiency, our implementation runs on a single GPU (NVIDIA RTX 3080) and takes about 48 hours to train for 1.5 million steps. The β-VAE training adds 4 hours upfront but significantly speeds up overall convergence. For real-time deployment, the actor network requires minimal inference time, allowing fire drones to make decisions within milliseconds. This meets the latency requirements for autonomous firefighting, where rapid responses are critical. We also experiment with different β values in the VAE; β=5 yields the best trade-off between reconstruction accuracy and latent disentanglement, aiding fire drones in distinguishing fire from other hot objects like sunlight reflections.

Comparing our work to existing methods, traditional path planning algorithms like A* or RRT often struggle with dynamic environments and partial observability, while DRL-based approaches like DQN or DDPG may not handle multi-agent coordination effectively. MAPPO addresses these issues by combining policy optimization with centralized critic, making it suitable for fire drone teams. The integration of β-VAE is novel in this context, as most prior work uses raw images or manual feature extraction, which is less efficient. Our reward function also advances sparse reward solutions by combining dense and sparse elements, guiding fire drones more effectively than pure reward shaping.

Limitations of our study include the simulation-to-reality gap; while AirSim provides high-fidelity graphics, real-world factors like sensor noise and communication delays may affect performance. Future work will involve transferring trained policies to physical fire drones for field testing. Additionally, we assume fire drones fly at fixed heights, but in practice, varying altitudes could improve flexibility. Extending the action space to include altitude control is a natural next step. Another direction is to incorporate fire suppression mechanisms, where fire drones not only locate but also extinguish fires, requiring more complex reward functions and task hierarchies.

In conclusion, we present a comprehensive framework for urban high-rise firefighting using multi-agent deep reinforcement learning for fire drones. By formulating the problem as a POMDP and employing MAPPO with β-VAE, we enable fire drones to collaboratively navigate complex environments, reach fire zones quickly, and avoid obstacles. Our simulation results demonstrate significant improvements over baseline algorithms in terms of convergence and task performance. This work lays the foundation for autonomous firefighting systems that can enhance urban safety and reduce reliance on human firefighters in hazardous situations. As fire drone technology evolves, integrating advanced AI algorithms like ours will be key to realizing their full potential in emergency response.