In recent years, drone technology has been widely applied in various fields such as data collection, mobile communication, and emergency response. Multi-UAV cooperative systems offer significant advantages in mission coverage, execution efficiency, and robustness compared to single-UAV systems. However, cooperative obstacle avoidance route planning in cluttered three-dimensional environments remains a challenging problem, especially when each UAV operates under partial observability and dynamic interactions among agents cause non-stationarity. Traditional multi-agent reinforcement learning (MARL) methods often suffer from slow convergence and insufficient coordination. To address these issues, we propose a prior-guided temporal fusion value decomposition algorithm, termed PGL-QMIX, which integrates global heuristic guidance with local temporal modeling to enhance learning efficiency and cooperative behavior.

Figure above illustrates a typical multi-UAV system operating in a bounded airspace with static obstacles. Each UAV must navigate from its start point to a designated target while avoiding collisions with obstacles and other UAVs, minimizing total flight distance. This problem can be formulated as a Partially Observable Markov Decision Process (POMDP).
Problem Formulation
We consider a set of N UAVs denoted by U = {U1, U2, …, UN} and a set of M static obstacles O = {o1, o2, …, oM}. The environment is discretized into a three-dimensional grid. At each discrete time step t, UAV i takes an action ai from a discrete action space A (i.e., +x, -x, +y, -y, +z, -z, stay) and transitions to a new position pi(t). The objective is to minimize the total path length of all UAVs while satisfying safety constraints:
$$
\min \sum_{i=1}^{N} d_{L_i}
$$
subject to:
$$
0 \le d_{L_i} \le d_{L_{\max}}, \quad \forall i
$$
$$
\|\mathbf{p}_i(t) – \mathbf{p}_j(t)\| \ge d_{\min}, \quad \forall i \neq j, \forall t
$$
$$
\|\mathbf{p}_i(t) – \mathbf{o}_k\| \ge d_{\min}, \quad \forall i, \forall k, \forall t
$$
$$
\mathbf{p}_i(0) = \mathbf{ST}_i, \quad \mathbf{p}_i(T_i) = \mathbf{TA}_i
$$
where dmin is the minimum safety distance and dLmax is the maximum allowable path length.
The POMDP is defined as a tuple $\langle \mathcal{S}, \mathcal{A}, P, R, \mathcal{O}, \Omega, \gamma \rangle$. The global state $\mathbf{s}_t$ is only accessible during centralized training. Each UAV i receives a local observation $\mathbf{o}_{i,t}$ which includes its current position, target position, distance to obstacles, distance to other UAVs, and distance to the nearest reference path node within a spherical sensing radius $r_o$. The observation is defined as:
$$
\mathbf{o}_{i,t} = \{ \mathbf{p}_{i,t}, \mathbf{p}_{i}^{\text{goal}}, d_{i,t}^{\text{obstacle}}, d_{i,t}^{\text{uav}}, d_{i,t}^{\text{ref}} \}
$$
The reward function combines path efficiency and safety constraints using potential-based shaping. The team reward at time step t is:
$$
r_t = -\mu – \rho_o(t) – \rho_u(t) + (\gamma \Phi(\mathbf{s}_{t+1}) – \Phi(\mathbf{s}_t)) + r_{\text{goal}} I_{\text{goal}} – r_{\text{coll}} I_{\text{coll}}
$$
where $\rho_o(t)$ and $\rho_u(t)$ are soft penalty terms for proximity to obstacles and other UAVs, respectively. $\Phi(\mathbf{s}_t) = \kappa_g \bar{d}_g(t) + \kappa_r \bar{d}_{\text{ref}}(t)$ is a potential function that encourages progress toward the target and alignment with the A* reference path. The discount factor $\gamma$ balances immediate and long-term rewards.
Prior-Guided Temporal Fusion Value Decomposition (PGL-QMIX)
Offline Prior Generation
In the offline stage, we compute a shortest reference path for each UAV using the A* algorithm on the static grid map. The A* evaluation function is:
$$
f(n) = g(n) + h(n)
$$
where $g(n)$ is the cost from the start node to node n, and $h(n)$ is a heuristic estimate of the remaining cost to the target. This reference path provides global geometric trend information.
Local Prior Extraction and Scoring
During online execution, each UAV only observes the portion of the reference path that lies within its sensing radius $r_o$. For UAV i, the local prior segment is:
$$
\Omega^{(i)}_t = \{ P^i_j \in P^i_{\text{ref}} \mid \|P^i_j – \mathbf{p}_i(t)\| \le r_o \}
$$
For each node in this segment, we compute a heuristic score that reflects its current importance:
$$
\omega^{(i)}_j(t) = \frac{d^{\text{obs}}_i(P^i_j) + \min_k d^{\text{uav}}_{ik}(P^i_j)}{d^{\text{ref}}_i(P^i_j) + d^{g}_i(P^i_j) + 1}
$$
where $d^{\text{obs}}_i$ is the distance from the node to the nearest obstacle, $d^{\text{uav}}_{ik}$ is the distance to the nearest other UAV, $d^{\text{ref}}_i$ is the distance to the current UAV position, and $d^{g}_i$ is the distance to the target. Higher scores indicate nodes with higher reference value at the current moment.
Individual-Level Temporal Fusion
The local observation $\mathbf{o}_{i,t}$ and the score vector $\omega^{(i)}_j(t)$ are concatenated and fed into an individual LSTM network. This LSTM captures temporal dependencies between local dynamics and prior guidance. The forward pass is:
$$
\begin{aligned}
\mathbf{x}_{i,t} &= [\mathbf{o}_{i,t}; \omega^{(i)}_j(t)] \\
\mathbf{f}_{i,t} &= \sigma(\mathbf{W}_f [\mathbf{h}_{i,t-1}, \mathbf{x}_{i,t}] + \mathbf{b}_f) \\
\mathbf{i}_{i,t} &= \sigma(\mathbf{W}_i [\mathbf{h}_{i,t-1}, \mathbf{x}_{i,t}] + \mathbf{b}_i) \\
\tilde{\mathbf{C}}_{i,t} &= \tanh(\mathbf{W}_c [\mathbf{h}_{i,t-1}, \mathbf{x}_{i,t}] + \mathbf{b}_c) \\
\mathbf{C}_{i,t} &= \mathbf{f}_{i,t} \odot \mathbf{C}_{i,t-1} + \mathbf{i}_{i,t} \odot \tilde{\mathbf{C}}_{i,t} \\
\mathbf{o}_{i,t}^{\text{LSTM}} &= \sigma(\mathbf{W}_o [\mathbf{h}_{i,t-1}, \mathbf{x}_{i,t}] + \mathbf{b}_o) \\
\mathbf{h}_{i,t} &= \mathbf{o}_{i,t}^{\text{LSTM}} \odot \tanh(\mathbf{C}_{i,t})
\end{aligned}
$$
The hidden state $\mathbf{h}_{i,t}$ serves as a temporal feature representing both the local environment and the prior-guided trend.
System-Level Temporal Mixing
To capture inter-agent dependencies and non-stationarity, we introduce a system-level LSTM that operates on the concatenation of all individual hidden states:
$$
\mathbf{m}_t = \text{Concat}\left( \mathbf{h}_{1,t}, \mathbf{h}_{2,t}, \dots, \mathbf{h}_{N,t} \right)
$$
$$
[\mathbf{z}_t, \mathbf{C}_t] = \text{LSTM}_{\text{mix}}(\mathbf{m}_t, [\mathbf{z}_{t-1}, \mathbf{C}_{t-1}]; \theta_m)
$$
From the system-level hidden state $\mathbf{z}_t$, we generate time-varying mixing weights and bias:
$$
W_i(t) = \text{SoftPlus}\left( \mathbf{W}^{(L)}_i \mathbf{z}_t + \mathbf{b}^{(L)}_i \right)
$$
$$
b(t) = \mathbf{W}^{(L)}_b \mathbf{z}_t + \mathbf{b}^{(L)}_b
$$
The individual action-value function for UAV i is defined as:
$$
Q_i(\mathbf{h}_{i,t}, a_{i,t}; \theta_i) = \mathbb{E}\left[ r_t + \gamma \max_{a’} Q_i(\mathbf{h}_{i,t+1}, a’; \theta_i^-) \right]
$$
The joint action-value function is then computed as a monotonic weighted sum:
$$
Q_{\text{tot}}(\mathbf{h}_t, \mathbf{a}_t; \Theta) = \sum_{i=1}^{N} W_i(t) Q_i(\mathbf{h}_{i,t}, a_{i,t}; \theta_i) + b(t)
$$
where $\Theta = \{\theta_1, \dots, \theta_N, \theta_m, \mathbf{W}^{(L)}, \mathbf{b}^{(L)}\}$. The network is trained to minimize the temporal-difference error:
$$
\mathcal{L}(\theta) = \sum_{t=0}^{T} \left[ Y_t – Q_{\text{tot}, t}(\mathbf{h}_t, \mathbf{a}_t; \theta) \right]^2
$$
where $Y_t = R(t) + \gamma \max_{\mathbf{a}’} Q_{\text{tot}, t+1}'(\mathbf{h}_{t+1}, \mathbf{a}’; \theta^-)$. Target network parameters $\theta^-$ are updated via soft updates.
Complexity Analysis
Let N be the number of UAVs, V be the number of grid nodes, A be the action space size, Din be the input dimension, and Hind be the LSTM hidden dimension. Online execution complexity per time step is $O(N \cdot (H_{\text{ind}}^2 + D_{\text{in}} H_{\text{ind}} + |A| + K))$, where K is the number of local prior nodes (constant given fixed sensing radius). Memory scales linearly with N, avoiding exponential explosion of joint action space.
Experimental Results
Simulation Setup
We evaluate our method in three voxelized 3D grid environments of sizes 203, 303, and 403 with unit voxel side length. Four UAVs are deployed with a spherical sensing radius of 3 units and minimum safety distance of 1 unit. Training runs for 150,000 steps, and performance is evaluated every 200 steps on 128 random maps. The table below summarizes hyperparameters.
| Parameter | Value |
|---|---|
| Discount factor $\gamma$ | 0.99 |
| Replay buffer size | 10,000 |
| Batch size | 128 |
| Total training steps | 150,000 |
| Max steps per episode | 200 |
| Initial learning rate | $5 \times 10^{-4}$ |
| Optimizer | Adam |
| LSTM hidden units | 128 |
| Fully connected hidden units | 128 |
| Minimum exploration rate | 0.01 |
Convergence Performance
Figure 2 (not shown here) compares the reward curves of seven algorithms: PGL-QMIX, No-Prior, No-iLSTM, No-sLSTM, MAPPO, QMIX, and VDN. Our method achieves faster convergence and higher steady-state rewards in all three scenarios. Quantitative results are provided in the table below.
| Scenario | Algorithm | AUC (×104) | Steady-state avg reward | Convergence steps (×104) |
|---|---|---|---|---|
| 203 | PGL-QMIX | 16.911 | 137.57 | 3.24 |
| No-Prior | 16.321 | 134.25 | 3.74 | |
| No-iLSTM | 15.705 | 130.87 | 4.06 | |
| No-sLSTM | 16.125 | 130.89 | 3.34 | |
| MAPPO | 16.028 | 131.16 | 3.58 | |
| QMIX | 15.512 | 127.67 | 4.27 | |
| VDN | 14.356 | 127.46 | 4.62 | |
| 303 | PGL-QMIX | 25.406 | 209.87 | 3.36 |
| No-Prior | 24.329 | 201.88 | 3.90 | |
| No-iLSTM | 23.887 | 197.42 | 3.96 | |
| No-sLSTM | 24.023 | 198.90 | 3.62 | |
| MAPPO | 23.922 | 199.37 | 3.92 | |
| QMIX | 23.032 | 198.46 | 4.64 | |
| VDN | 22.350 | 192.91 | 4.88 | |
| 403 | PGL-QMIX | 33.139 | 273.59 | 3.74 |
| No-Prior | 32.127 | 269.02 | 4.24 | |
| No-iLSTM | 29.474 | 262.21 | 4.32 | |
| No-sLSTM | 32.017 | 267.05 | 4.04 | |
| MAPPO | 30.461 | 265.40 | 4.22 | |
| QMIX | 28.281 | 261.32 | 4.82 | |
| VDN | 26.819 | 253.85 | 6.08 |
The area under the reward curve (AUC) of PGL-QMIX is the highest in all scenarios, indicating superior overall learning efficiency. The convergence steps to reach 80% success rate are also significantly lower.
Task Success Rate
Table 3 shows the success rate performance. Our method achieves the highest steady-state average success rates: 96.7%, 94.3%, and 92.9% in the three scenarios, respectively, and a maximum success rate of 100% in every scenario. The convergence to 80% success rate occurs at 41,000, 51,000, and 56,000 steps for 203, 303, and 403, respectively.
| Scenario | Algorithm | First 80% step (×104) | Steady avg success (%) | Max success (%) |
|---|---|---|---|---|
| 203 | PGL-QMIX | 4.14 | 95.73 | 100.0 |
| No-Prior | 5.44 | 94.18 | 98.4 | |
| No-iLSTM | 5.28 | 89.10 | 98.4 | |
| No-sLSTM | 4.90 | 94.47 | 99.2 | |
| MAPPO | 4.88 | 93.23 | 99.2 | |
| QMIX | 5.14 | 90.50 | 98.4 | |
| VDN | 6.08 | 79.93 | 96.9 | |
| 303 | PGL-QMIX | 5.06 | 94.03 | 100.0 |
| No-Prior | 8.50 | 87.92 | 96.9 | |
| No-iLSTM | 9.48 | 84.72 | 93.8 | |
| No-sLSTM | 5.72 | 89.62 | 96.9 | |
| MAPPO | 6.78 | 88.27 | 93.8 | |
| QMIX | 11.22 | 79.04 | 92.2 | |
| VDN | 8.98 | 66.47 | 84.4 | |
| 403 | PGL-QMIX | 5.64 | 92.92 | 100.0 |
| No-Prior | 10.94 | 82.30 | 89.8 | |
| No-iLSTM | 14.08 | 74.76 | 82.0 | |
| No-sLSTM | 6.24 | 84.80 | 91.2 | |
| MAPPO | 9.88 | 84.73 | 89.8 | |
| QMIX | – | 57.67 | 65.6 | |
| VDN | – | 55.94 | 78.1 |
Path Efficiency
We evaluate the average path length over 128 random test maps. Table 4 reports the minimum obstacle distance, average obstacle distance, inter-UAV nearest distance, average path length, and maximum path length. Our method produces the shortest average path length across all scenarios, closely approximating the A* reference baseline. Compared with QMIX, PGL-QMIX reduces the average path length by 8.8%, 12.3%, and 16.1% for 203, 303, and 403, respectively.
| Scenario | Algorithm | Min obst dist | Avg obst dist | Inter-UAV min dist | Avg path len | Max path len |
|---|---|---|---|---|---|---|
| 203 | PGL-QMIX | 2.73 | 6.03 | 14.04 | 58.08 | 73.25 |
| No-Prior | 2.41 | 5.93 | 12.71 | 61.95 | 94.75 | |
| No-iLSTM | 2.24 | 5.84 | 11.09 | 69.65 | 99.75 | |
| No-sLSTM | 2.24 | 5.51 | 11.46 | 68.84 | 97.25 | |
| MAPPO | 2.41 | 5.97 | 11.47 | 64.84 | 93.75 | |
| QMIX | 2.00 | 5.22 | 8.06 | 73.20 | 100.75 | |
| VDN | 1.71 | 5.11 | 6.20 | 76.73 | 108.25 | |
| 303 | PGL-QMIX | 2.41 | 6.54 | 17.26 | 92.41 | 119.25 |
| No-Prior | 2.00 | 6.32 | 15.81 | 100.97 | 137.00 | |
| No-iLSTM | 1.41 | 5.62 | 13.85 | 107.30 | 156.50 | |
| No-sLSTM | 1.73 | 6.11 | 12.17 | 108.79 | 148.50 | |
| MAPPO | 2.00 | 6.21 | 16.14 | 101.84 | 141.50 | |
| QMIX | 1.41 | 5.41 | 11.52 | 114.56 | 156.25 | |
| VDN | 1.41 | 5.22 | 10.47 | 118.79 | 163.00 | |
| 403 | PGL-QMIX | 2.24 | 5.86 | 29.27 | 119.65 | 139.25 |
| No-Prior | 1.73 | 5.08 | 20.35 | 134.65 | 147.25 | |
| No-iLSTM | 1.00 | 4.48 | 18.15 | 145.07 | 170.25 | |
| No-sLSTM | 1.41 | 4.88 | 17.46 | 139.07 | 159.25 | |
| MAPPO | 1.73 | 5.12 | 19.79 | 132.98 | 149.50 | |
| QMIX | 1.00 | 4.50 | 13.42 | 148.52 | 187.50 | |
| VDN | 1.00 | 3.80 | 12.17 | 153.71 | 200.00 |
The results demonstrate that the combination of prior guidance and temporal fusion leads to more efficient paths with fewer oscillations and shorter detours. The ablation studies show that removing prior knowledge (No-Prior) or individual LSTM (No-iLSTM) significantly degrades performance, while removing system-level LSTM (No-sLSTM) also reduces coordination and increases path length.
Conclusion
In this work, we presented a prior-guided temporal fusion value decomposition algorithm for multi-UAV cooperative obstacle avoidance route planning. By integrating offline A* reference paths as weak priors with a dual-LSTM architecture for temporal modeling, our method achieves faster convergence, higher success rates, and shorter path lengths compared to state-of-the-art baselines. Drone technology stands to benefit greatly from such intelligent cooperative planning in safety-critical applications. Future work will incorporate realistic UAV dynamics and communication-aware cooperative obstacle avoidance to further enhance the practical applicability of drone technology in complex environments. We believe that the proposed PGL-QMIX framework provides a solid foundation for advancing autonomous multi-UAV systems.
