Two-Stage Deep Reinforcement Learning for Multi-UAV Mission Planning

Inspection tasks, whether for large-scale infrastructure or for aerial entertainment, demand high efficiency and adaptability. Traditional manual inspection methods often fall short, plagued by inefficiency and scalability issues. The advent of Unmanned Aerial Vehicle (UAV) clusters offers a transformative solution. This technology is pivotal not only for industrial applications like aircraft skin inspection but also for creating large-scale, dynamic aerial displays known as drone light shows. Coordinating a fleet of drones to cover specific areas or form intricate, moving patterns in the sky presents a significant computational challenge: how to optimally assign tasks (inspection points or display positions) to each drone and plan their flight paths to minimize completion time while respecting physical constraints like battery life and collision avoidance.

This mission planning problem for multiple UAVs is intrinsically linked to classic combinatorial optimization problems in operations research, particularly the Vehicle Routing Problem (VRP). In a drone light show, each drone can be seen as a “vehicle” that must visit a sequence of assigned “customer” points (specific positions in the formation) within its operational endurance. The goal is to orchestrate the entire fleet’s movements to achieve the synchronized display in the shortest possible time, which translates to minimizing the maximum flight distance or time taken by any single drone. Efficiently solving this problem is crucial for the seamless execution of complex drone light show choreography and for the practical deployment of UAVs in time-sensitive inspection scenarios.

While VRP and its variants have been extensively studied, traditional solution methods like exact algorithms or hand-crafted heuristics often struggle with the scale and real-time requirements of modern UAV applications. Exact methods become computationally intractable for large fleets, while heuristics, though faster, may yield suboptimal solutions and require significant domain expertise to design. Recently, Deep Reinforcement Learning (DRL) has emerged as a powerful paradigm for learning to solve such problems directly from data, offering a promising balance between solution quality and computational speed.

However, existing DRL approaches for VRP-like problems often employ generic encoder-decoder architectures that may not fully capture the dynamic state of a multi-agent system. For instance, in a drone light show, the “context” of the planning problem changes as drones are assigned to positions; some positions are already allocated, while others remain available. A static representation of the problem graph fails to reflect this evolution, potentially leading to poor decision-making. Furthermore, effectively modeling each drone’s individual state—its remaining battery, its current assigned path—within a shared policy network remains challenging.

To address these limitations, this work proposes a novel Two-Stage Deep Reinforcement Learning (TSDRL) framework specifically designed for multi-UAV mission planning. The core of our approach is a sophisticated policy network that dynamically adapts to the planning state. In the first stage, the model determines a suitable number of drones based on the task’s scale. The second stage, which is our primary contribution, features an encoder-decoder architecture with key innovations: 1) An enhanced encoder utilizing dual Batch Normalization and a gated aggregation module to produce higher-quality node embeddings; 2) A decoder equipped with a dedicated Drone Selection Module and a novel Graph Aggregation Module. The Graph Aggregation Module dynamically constructs separate embeddings for visited and unvisited nodes, allowing the policy to be acutely aware of the changing context as assignments are made, much like tracking which positions in a drone light show formation have already been taken. This enables more informed decision-making when selecting the next task for a chosen drone.

We formulate the UAV mission planning problem as a Capacitated VRP (CVRP) in three-dimensional space, where the “capacity” is the drone’s maximum flight range. Our TSDRL model is trained end-to-end using a policy gradient method. Extensive simulations on task sets of varying scales demonstrate that our proposed method significantly outperforms existing DRL baselines and is competitive with state-of-the-art heuristic solvers in terms of solution quality, while being orders of magnitude faster. The model also exhibits strong generalization ability across different problem instances and scales. The principles developed here for optimal task allocation and path planning are directly applicable to the core choreography challenge in a drone light show, where hundreds or thousands of drones must be routed through their keyframe positions efficiently and safely.

Problem Formulation and Modeling

The Multi-UAV Cooperative Mission Planning (MCMP) problem is formalized as follows. Consider a set of $n+1$ points $X = \{x_i\}_{i=0}^{n}$, where $x_0$ represents the common starting depot (e.g., a launch pad), and $X’ = X \setminus \{x_0\}$ represents the $n$ task points that must be visited. In a drone light show, these would be the key positions in the aerial formation. Each point $x_i \in \mathbb{R}^4$ is defined by its 3D coordinates and a demand $d_i$. Unlike standard CVRP, the demand $d_i$ is dynamic: it represents the collision-free flight distance from the previous point in a drone’s path to $x_i$, computed using a path planner like RRT*: $d_i = \text{RRT*}(x_{i-1}, x_i)$. This accounts for the real-world geometry and obstacles.

A fleet of $U_{\text{num}}$ drones is available, each with an identical maximum flight capacity $Q$. The state of drone $v$ is denoted by $(O^v, G^v)$, where $O^v$ is its remaining flight range, and $G^v$ is the sequence of points it has visited. A binary decision variable $y_{ij}^v$ equals 1 if drone $v$ travels directly from $x_i$ to $x_j$, and 0 otherwise. The objective is to minimize the maximum distance traveled by any drone in the fleet, ensuring a balanced workload and timely completion—a critical metric for synchronizing a drone light show.

The mathematical model is given by:

$$
\min \ \max_{v \in V} \left( \sum_{i \in X} \sum_{j \in X} \text{RRT*}(x_i, x_j) \cdot y_{ij}^v \right)
$$

Subject to:

$$
\begin{aligned}
&\sum_{v \in V} \sum_{j \in X} y_{ij}^v = 1, &&\forall i \in X’ \quad &\text{(Each point served once)} \\
&\sum_{i \in X} y_{ij}^v – \sum_{k \in X} y_{jk}^v = 0, &&\forall v \in V, j \in X’ \quad &\text{(Flow conservation)} \\
&d_j y_{ij}^v \leq L_{ij}^v \leq (Q_v – d_j) y_{ij}^v, &&\forall v \in V, i,j \in X \quad &\text{(Capacity constraint)} \\
&y_{ij}^v \in \{0,1\}, L_{ij}^v \geq 0, d_i \geq 0
\end{aligned}
$$

Here, $L_{ij}^v$ is an auxiliary variable representing the remaining capacity of drone $v$ before servicing $j$ from $i$.

Two-Stage Deep Reinforcement Learning Framework

We propose a TSDRL framework to solve the MCMP. The first stage estimates the required number of drones, $U_{\text{num}}$, for a given set of points. The second stage, which is our main focus, performs the detailed task allocation and routing using the determined fleet size.

Stage 1: Fleet Size Estimation

The first stage is modeled as a Markov Decision Process (MDP). The state $S_1$ includes global graph features and the current partial solution (number of drones used and their paths). The action $a_{1,t}$ is selecting the next point to be added to the current drone’s route. The policy $\pi_{\theta_1}$ is a neural network that outputs a probability distribution over unvisited points. The reward is the negative of the maximum tour length found upon completing the allocation. This stage is trained using the REINFORCE algorithm with a rollout baseline to reduce variance.

The network architecture is an encoder-decoder based on the Transformer. The encoder processes node coordinates into embeddings $h_i^{(N)}$. The decoder constructs a context embedding from the graph summary $\bar{h}^{(N)}_g$, the last node’s embedding, and the current drone’s remaining capacity, then computes selection logits.

Stage 2: Detailed Task Allocation and Path Planning

The second stage handles the core planning. Its MDP is defined by a state $S_2 = (V_t, X_t)$, representing the status of all drones $V_t$ and all points $X_t$ at time $t$. An action $a_{2,t} = (v^i_t, x^j_t)$ selects a drone $v^i$ and a point $x^j$ for it to service next. The goal is to learn a policy $\pi_{\theta_2}(a_{2,t}|S_2)$ that maximizes the expected reward, which is the negative maximum tour length.

Encoder with Enhanced Feature Representation

Our encoder builds upon the standard Transformer but incorporates two key enhancements to improve the quality of node embeddings $e^{\text{node}}$, which is crucial for understanding complex spatial arrangements like those in a drone light show.

1. Dual Batch Normalization (BN): We apply BN not only after the residual addition but also before the Multi-Head Attention (MHA) and Feed-Forward (FF) layers. This stabilizes the input distribution to these layers, leading to more stable and effective training.

2. Gated Aggregation Module: Instead of a simple residual addition, we use a gating mechanism to combine the input and output of the MHA and FF layers. This allows the network to dynamically control the flow of information. The gated output $G(e_{\text{in}}, e_{\text{out}})$ is computed as:
$$ G(e_{\text{in}}, e_{\text{out}}) = e_{\text{in}} + \sigma(W e_{\text{in}} + b) \odot e_{\text{out}} $$
where $\sigma$ is the sigmoid function, $W$ and $b$ are learnable parameters, and $\odot$ denotes element-wise multiplication. This results in richer and more expressive node embeddings.

Decoder with Dynamic Context Awareness

The decoder’s role is to sequentially select (drone, point) pairs. It consists of two novel modules:

1. Drone Selection Module: This module decides which drone should act next. It constructs a feature vector for each drone $v^j$, including its remaining capacity and the embedding of the last point it visited. All drone paths’ features are aggregated via max-pooling to form a global path context. These features are processed by linear layers and a softmax to produce a probability distribution $p^V_t$ over drones.

2. Graph Aggregation Module: This is the cornerstone of our dynamic context modeling. Standard methods use a static graph embedding (the mean of all node embeddings). In a drone light show, as positions get assigned, the “unassigned” part of the formation shrinks. Our module dynamically builds two separate graph embeddings:
– Visited Graph Embedding ($\hat{H}^v_t$): An aggregation of embeddings for points already assigned to any drone.
– Unvisited Graph Embedding ($\hat{H}^{unv}_t$): An aggregation of embeddings for points still available.
Each is computed by applying an attention mask over the node embeddings, then concatenating the sum and max-pooled representations to preserve both global and salient local features.

The decoder context $e^{\text{context}}_t$ is then formed by concatenating these two dynamic graph embeddings with the embeddings of the first and last visited nodes in the current partial solution:
$$ e^{\text{context}}_t = \text{concat}(\hat{H}^v_t, \hat{H}^{unv}_t, e^{\text{node}}_t, e^{\text{node}}_{t-1}) $$
This context, rich with dynamic state information, is fed into an attention layer alongside all node embeddings to compute the final logits and probabilities for point selection, conditioned on the chosen drone.

Training Methodology

Both stages are trained using the REINFORCE policy gradient algorithm with a rollout baseline network. The gradient for the policy network parameters $\theta$ is estimated as:
$$ \nabla_\theta \mathcal{L}(\theta|B) = -\mathbb{E}_{p_\theta(B)}\left[ (R(\pi) – R(\pi^{bl})) \nabla_\theta \log p_\theta(\pi|B) \right] $$
where $R(\pi)$ is the reward (negative maximum tour length) of the sampled solution $\pi$, and $R(\pi^{bl})$ is the reward from the greedy solution of the baseline network. The baseline network parameters are updated to match the policy network periodically, which stabilizes training and accelerates convergence. This learned policy can later be used to rapidly plan efficient trajectories for a drone light show fleet.

Experimental Results and Analysis

We evaluate our TSDRL model on task sets derived from a 3D aircraft model, with point counts of 30, 50, and 90 (plus the depot). Each set has 20,000 instances for training and 100 for testing. The drone capacity $Q$ is set to 400m. We compare against several baselines:
• SISRs: A state-of-the-art heuristic for VRP.
• SA: A Simulated Annealing algorithm.
• AM: The Attention Model DRL approach.
• GAT: A Graph Attention Network model for VRP.
We report the objective value (maximum tour length, in meters), computation time (seconds), number of drones used (#UAVs), and the optimality gap relative to SISRs.

Performance Comparison

The following tables summarize the results for different problem scales. TSDRL-G and TSDRL-S refer to our model using greedy decoding and sampling (1000 samples) with search, respectively.

Table 1: Performance on the 30-point task set.
Algorithm	Obj. (m)	Time (s)	#UAVs	Gap (%)
SISRs	209.45	334.68	1	—
SA	259.52	133.45	2	14.36
AM (Greedy)	362.83	0.59	1	73.22
AM (Sample 1000)	285.46	5.80	1	36.29
GAT (Greedy)	302.74	0.96	1	44.54
GAT (Sample 1000)	248.16	20.15	1	28.37
TSDRL-G	296.43	0.57	1	41.52
TSDRL-S	248.16	5.80	1	18.45

Table 2: Performance on the 50-point task set.
Algorithm	Obj. (m)	Time (s)	#UAVs	Gap (%)
SISRs	257.12	589.46	2	—
SA	348.62	317.23	3	35.58
AM (Greedy)	371.96	0.62	2	44.66
AM (Sample 1000)	350.20	6.60	2	36.20
GAT (Greedy)	367.87	0.88	2	43.07
GAT (Sample 1000)	344.94	24.48	2	34.15
TSDRL-G	354.21	0.59	2	38.20
TSDRL-S	334.59	6.40	2	30.12

Table 3: Performance on the 90-point task set.
Algorithm	Obj. (m)	Time (s)	#UAVs	Gap (%)
SISRs	329.16	1108.47	4	—
SA	366.55	784.62	5	11.30
AM (Greedy)	389.45	0.66	4	18.31
AM (Sample 1000)	367.25	8.80	4	11.57
GAT (Greedy)	370.86	1.33	4	12.66
GAT (Sample 1000)	364.83	32.15	4	10.83
TSDRL-G	368.32	0.61	4	11.89
TSDRL-S	352.63	8.60	4	7.13

Key Observations:
1. Solution Quality: TSDRL-S consistently achieves the best objective values among all learning-based methods (AM, GAT, TSDRL-G) across all scales. Its optimality gap to the powerful heuristic SISRs is significantly smaller than that of other DRL models.
2. Computation Time: The inference time of TSDRL is orders of magnitude faster than heuristic solvers (SISRs, SA). While SISRs’ time grows rapidly with problem size, TSDRL’s time remains nearly constant and very low (under 9 seconds even with sampling). This makes it highly suitable for real-time or rapid planning scenarios, a necessity for dynamic drone light show choreography.
3. Model Superiority: TSDRL outperforms both AM and GAT, demonstrating the effectiveness of our architectural innovations, particularly the dynamic Graph Aggregation Module.

Ablation Study

We conduct an ablation study on the 90-point set to validate the contributions of our core modules:
• TSDRL-N1: Removes the Graph Aggregation Module (keeps Dual BN & Gating).
• TSDRL-N2: Removes the Dual BN & Gating (keeps Graph Aggregation).
• TSDRL-N3: Removes both (equivalent to a baseline similar to AM).
• Full TSDRL: Our complete model.

Table 4: Ablation study results (Sample 1000) on 90-point set.
Ablation Model	Obj. TSDRL-S (m)	Gap (%)
TSDRL-N1	362.71	10.19
TSDRL-N2	358.64	8.96
TSDRL-N3	367.25	11.57
Full TSDRL	352.63	7.13

The results clearly show that each component contributes to the final performance. The Full TSDRL model achieves the lowest gap, indicating that the Dual BN/Gating and the Graph Aggregation Module are complementary and synergistic. The Graph Aggregation Module (N2 vs. N3) provides a substantial boost, underlining the importance of dynamic context awareness for effective planning in complex multi-agent tasks like a drone light show.

Generalization Test

To test generalization, we trained the TSDRL model on a 90-point dataset from one aircraft model and evaluated it on different-sized datasets (20, 40, 80 points) from another aircraft model without fine-tuning. TSDRL-S maintained superior performance compared to AM and GAT across these unseen scales and geometries, demonstrating its strong generalization capability. This robustness is essential for applying the planner to various drone light show formations or different inspection targets.

Conclusion and Future Work

This paper presented a Two-Stage Deep Reinforcement Learning framework for solving the Multi-UAV Cooperative Mission Planning problem, modeled as a three-dimensional Capacitated Vehicle Routing Problem. The core innovation lies in the second stage’s policy network, which features an encoder with dual Batch Normalization and gated aggregation for robust feature extraction, and a decoder with a Drone Selection Module and a novel Graph Aggregation Module. The Graph Aggregation Module dynamically models the evolving planning context by maintaining separate representations for visited and unvisited nodes, a mechanism that is intuitively powerful for managing the state of a large-scale drone light show as positions are assigned.

Extensive simulations demonstrated that TSDRL outperforms existing DRL methods in solution quality and is competitive with advanced heuristic solvers while being drastically faster in computation. The model also exhibits excellent generalization to different problem instances and scales. These attributes make it a highly promising approach for real-world applications requiring rapid and high-quality multi-agent coordination, from industrial inspection to the precise and efficient choreography of breathtaking drone light shows.

Future work will focus on extending the model to handle more complex constraints, such as multiple depots, heterogeneous drone fleets, and dynamic environments where tasks appear online. Investigating the integration of this planner with real-time trajectory generation and collision avoidance systems for live drone light show operations is also a critical next step.