Prior-Guided Temporal Fusion for Multi-UAV Cooperative Obstacle Avoidance Route Planning

In recent years, drone technology has been widely applied in various fields such as data collection, mobile communication, and emergency response. Multi-UAV cooperative systems offer significant advantages in mission coverage, execution efficiency, and robustness compared to single-UAV systems. However, cooperative obstacle avoidance route planning in cluttered three-dimensional environments remains a challenging problem, especially when each UAV operates under partial observability and dynamic interactions among agents cause non-stationarity. Traditional multi-agent reinforcement learning (MARL) methods often suffer from slow convergence and insufficient coordination. To address these issues, we propose a prior-guided temporal fusion value decomposition algorithm, termed PGL-QMIX, which integrates global heuristic guidance with local temporal modeling to enhance learning efficiency and cooperative behavior.

Figure above illustrates a typical multi-UAV system operating in a bounded airspace with static obstacles. Each UAV must navigate from its start point to a designated target while avoiding collisions with obstacles and other UAVs, minimizing total flight distance. This problem can be formulated as a Partially Observable Markov Decision Process (POMDP).

Problem Formulation

We consider a set of N UAVs denoted by U = {U₁, U₂, …, U_N} and a set of M static obstacles O = {o₁, o₂, …, o_M}. The environment is discretized into a three-dimensional grid. At each discrete time step t, UAV i takes an action a_i from a discrete action space A (i.e., +x, -x, +y, -y, +z, -z, stay) and transitions to a new position p_i(t). The objective is to minimize the total path length of all UAVs while satisfying safety constraints:

$$
\min \sum_{i=1}^{N} d_{L_i}
$$

subject to:

$$
0 \le d_{L_i} \le d_{L_{\max}}, \quad \forall i
$$

$$
\|\mathbf{p}_i(t) – \mathbf{p}_j(t)\| \ge d_{\min}, \quad \forall i \neq j, \forall t
$$

$$
\|\mathbf{p}_i(t) – \mathbf{o}_k\| \ge d_{\min}, \quad \forall i, \forall k, \forall t
$$

$$
\mathbf{p}_i(0) = \mathbf{ST}_i, \quad \mathbf{p}_i(T_i) = \mathbf{TA}_i
$$

where d_min is the minimum safety distance and d_Lmax is the maximum allowable path length.

The POMDP is defined as a tuple $\langle \mathcal{S}, \mathcal{A}, P, R, \mathcal{O}, \Omega, \gamma \rangle$. The global state $\mathbf{s}_t$ is only accessible during centralized training. Each UAV i receives a local observation $\mathbf{o}_{i,t}$ which includes its current position, target position, distance to obstacles, distance to other UAVs, and distance to the nearest reference path node within a spherical sensing radius $r_o$. The observation is defined as:

$$
\mathbf{o}_{i,t} = \{ \mathbf{p}_{i,t}, \mathbf{p}_{i}^{\text{goal}}, d_{i,t}^{\text{obstacle}}, d_{i,t}^{\text{uav}}, d_{i,t}^{\text{ref}} \}
$$

The reward function combines path efficiency and safety constraints using potential-based shaping. The team reward at time step t is:

$$
r_t = -\mu – \rho_o(t) – \rho_u(t) + (\gamma \Phi(\mathbf{s}_{t+1}) – \Phi(\mathbf{s}_t)) + r_{\text{goal}} I_{\text{goal}} – r_{\text{coll}} I_{\text{coll}}
$$

where $\rho_o(t)$ and $\rho_u(t)$ are soft penalty terms for proximity to obstacles and other UAVs, respectively. $\Phi(\mathbf{s}_t) = \kappa_g \bar{d}_g(t) + \kappa_r \bar{d}_{\text{ref}}(t)$ is a potential function that encourages progress toward the target and alignment with the A* reference path. The discount factor $\gamma$ balances immediate and long-term rewards.

Prior-Guided Temporal Fusion Value Decomposition (PGL-QMIX)

Offline Prior Generation

In the offline stage, we compute a shortest reference path for each UAV using the A* algorithm on the static grid map. The A* evaluation function is:

$$
f(n) = g(n) + h(n)
$$

where $g(n)$ is the cost from the start node to node n, and $h(n)$ is a heuristic estimate of the remaining cost to the target. This reference path provides global geometric trend information.

Local Prior Extraction and Scoring

During online execution, each UAV only observes the portion of the reference path that lies within its sensing radius $r_o$. For UAV i, the local prior segment is:

$$
\Omega^{(i)}_t = \{ P^i_j \in P^i_{\text{ref}} \mid \|P^i_j – \mathbf{p}_i(t)\| \le r_o \}
$$

For each node in this segment, we compute a heuristic score that reflects its current importance:

$$
\omega^{(i)}_j(t) = \frac{d^{\text{obs}}_i(P^i_j) + \min_k d^{\text{uav}}_{ik}(P^i_j)}{d^{\text{ref}}_i(P^i_j) + d^{g}_i(P^i_j) + 1}
$$

where $d^{\text{obs}}_i$ is the distance from the node to the nearest obstacle, $d^{\text{uav}}_{ik}$ is the distance to the nearest other UAV, $d^{\text{ref}}_i$ is the distance to the current UAV position, and $d^{g}_i$ is the distance to the target. Higher scores indicate nodes with higher reference value at the current moment.

Individual-Level Temporal Fusion

The local observation $\mathbf{o}_{i,t}$ and the score vector $\omega^{(i)}_j(t)$ are concatenated and fed into an individual LSTM network. This LSTM captures temporal dependencies between local dynamics and prior guidance. The forward pass is:

$$
\begin{aligned}
\mathbf{x}_{i,t} &= [\mathbf{o}_{i,t}; \omega^{(i)}_j(t)] \\
\mathbf{f}_{i,t} &= \sigma(\mathbf{W}_f [\mathbf{h}_{i,t-1}, \mathbf{x}_{i,t}] + \mathbf{b}_f) \\
\mathbf{i}_{i,t} &= \sigma(\mathbf{W}_i [\mathbf{h}_{i,t-1}, \mathbf{x}_{i,t}] + \mathbf{b}_i) \\
\tilde{\mathbf{C}}_{i,t} &= \tanh(\mathbf{W}_c [\mathbf{h}_{i,t-1}, \mathbf{x}_{i,t}] + \mathbf{b}_c) \\
\mathbf{C}_{i,t} &= \mathbf{f}_{i,t} \odot \mathbf{C}_{i,t-1} + \mathbf{i}_{i,t} \odot \tilde{\mathbf{C}}_{i,t} \\
\mathbf{o}_{i,t}^{\text{LSTM}} &= \sigma(\mathbf{W}_o [\mathbf{h}_{i,t-1}, \mathbf{x}_{i,t}] + \mathbf{b}_o) \\
\mathbf{h}_{i,t} &= \mathbf{o}_{i,t}^{\text{LSTM}} \odot \tanh(\mathbf{C}_{i,t})
\end{aligned}
$$

The hidden state $\mathbf{h}_{i,t}$ serves as a temporal feature representing both the local environment and the prior-guided trend.

System-Level Temporal Mixing

To capture inter-agent dependencies and non-stationarity, we introduce a system-level LSTM that operates on the concatenation of all individual hidden states:

$$
\mathbf{m}_t = \text{Concat}\left( \mathbf{h}_{1,t}, \mathbf{h}_{2,t}, \dots, \mathbf{h}_{N,t} \right)
$$

$$
[\mathbf{z}_t, \mathbf{C}_t] = \text{LSTM}_{\text{mix}}(\mathbf{m}_t, [\mathbf{z}_{t-1}, \mathbf{C}_{t-1}]; \theta_m)
$$

From the system-level hidden state $\mathbf{z}_t$, we generate time-varying mixing weights and bias:

$$
W_i(t) = \text{SoftPlus}\left( \mathbf{W}^{(L)}_i \mathbf{z}_t + \mathbf{b}^{(L)}_i \right)
$$

$$
b(t) = \mathbf{W}^{(L)}_b \mathbf{z}_t + \mathbf{b}^{(L)}_b
$$

The individual action-value function for UAV i is defined as:

$$
Q_i(\mathbf{h}_{i,t}, a_{i,t}; \theta_i) = \mathbb{E}\left[ r_t + \gamma \max_{a’} Q_i(\mathbf{h}_{i,t+1}, a’; \theta_i^-) \right]
$$

The joint action-value function is then computed as a monotonic weighted sum:

$$
Q_{\text{tot}}(\mathbf{h}_t, \mathbf{a}_t; \Theta) = \sum_{i=1}^{N} W_i(t) Q_i(\mathbf{h}_{i,t}, a_{i,t}; \theta_i) + b(t)
$$

where $\Theta = \{\theta_1, \dots, \theta_N, \theta_m, \mathbf{W}^{(L)}, \mathbf{b}^{(L)}\}$. The network is trained to minimize the temporal-difference error:

$$
\mathcal{L}(\theta) = \sum_{t=0}^{T} \left[ Y_t – Q_{\text{tot}, t}(\mathbf{h}_t, \mathbf{a}_t; \theta) \right]^2
$$

where $Y_t = R(t) + \gamma \max_{\mathbf{a}’} Q_{\text{tot}, t+1}'(\mathbf{h}_{t+1}, \mathbf{a}’; \theta^-)$. Target network parameters $\theta^-$ are updated via soft updates.

Complexity Analysis

Let N be the number of UAVs, V be the number of grid nodes, A be the action space size, D_in be the input dimension, and H_ind be the LSTM hidden dimension. Online execution complexity per time step is $O(N \cdot (H_{\text{ind}}^2 + D_{\text{in}} H_{\text{ind}} + |A| + K))$, where K is the number of local prior nodes (constant given fixed sensing radius). Memory scales linearly with N, avoiding exponential explosion of joint action space.

Experimental Results

Simulation Setup

We evaluate our method in three voxelized 3D grid environments of sizes 20³, 30³, and 40³ with unit voxel side length. Four UAVs are deployed with a spherical sensing radius of 3 units and minimum safety distance of 1 unit. Training runs for 150,000 steps, and performance is evaluated every 200 steps on 128 random maps. The table below summarizes hyperparameters.

Table 1: Hyperparameters used in training
Parameter	Value
Discount factor $\gamma$	0.99
Replay buffer size	10,000
Batch size	128
Total training steps	150,000
Max steps per episode	200
Initial learning rate	$5 \times 10^{-4}$
Optimizer	Adam
LSTM hidden units	128
Fully connected hidden units	128
Minimum exploration rate	0.01

Convergence Performance

Figure 2 (not shown here) compares the reward curves of seven algorithms: PGL-QMIX, No-Prior, No-iLSTM, No-sLSTM, MAPPO, QMIX, and VDN. Our method achieves faster convergence and higher steady-state rewards in all three scenarios. Quantitative results are provided in the table below.

Table 2: Convergence performance comparison
Scenario	Algorithm	AUC (×10⁴)	Steady-state avg reward	Convergence steps (×10⁴)
20³	PGL-QMIX	16.911	137.57	3.24
	No-Prior	16.321	134.25	3.74
	No-iLSTM	15.705	130.87	4.06
	No-sLSTM	16.125	130.89	3.34
	MAPPO	16.028	131.16	3.58
	QMIX	15.512	127.67	4.27
	VDN	14.356	127.46	4.62
30³	PGL-QMIX	25.406	209.87	3.36
	No-Prior	24.329	201.88	3.90
	No-iLSTM	23.887	197.42	3.96
	No-sLSTM	24.023	198.90	3.62
	MAPPO	23.922	199.37	3.92
	QMIX	23.032	198.46	4.64
	VDN	22.350	192.91	4.88
40³	PGL-QMIX	33.139	273.59	3.74
	No-Prior	32.127	269.02	4.24
	No-iLSTM	29.474	262.21	4.32
	No-sLSTM	32.017	267.05	4.04
	MAPPO	30.461	265.40	4.22
	QMIX	28.281	261.32	4.82
	VDN	26.819	253.85	6.08

The area under the reward curve (AUC) of PGL-QMIX is the highest in all scenarios, indicating superior overall learning efficiency. The convergence steps to reach 80% success rate are also significantly lower.

Task Success Rate

Table 3 shows the success rate performance. Our method achieves the highest steady-state average success rates: 96.7%, 94.3%, and 92.9% in the three scenarios, respectively, and a maximum success rate of 100% in every scenario. The convergence to 80% success rate occurs at 41,000, 51,000, and 56,000 steps for 20³, 30³, and 40³, respectively.

Table 3: Task success rate comparison
Scenario	Algorithm	First 80% step (×10⁴)	Steady avg success (%)	Max success (%)
20³	PGL-QMIX	4.14	95.73	100.0
	No-Prior	5.44	94.18	98.4
	No-iLSTM	5.28	89.10	98.4
	No-sLSTM	4.90	94.47	99.2
	MAPPO	4.88	93.23	99.2
	QMIX	5.14	90.50	98.4
	VDN	6.08	79.93	96.9
30³	PGL-QMIX	5.06	94.03	100.0
	No-Prior	8.50	87.92	96.9
	No-iLSTM	9.48	84.72	93.8
	No-sLSTM	5.72	89.62	96.9
	MAPPO	6.78	88.27	93.8
	QMIX	11.22	79.04	92.2
	VDN	8.98	66.47	84.4
40³	PGL-QMIX	5.64	92.92	100.0
	No-Prior	10.94	82.30	89.8
	No-iLSTM	14.08	74.76	82.0
	No-sLSTM	6.24	84.80	91.2
	MAPPO	9.88	84.73	89.8
	QMIX	–	57.67	65.6
	VDN	–	55.94	78.1

Path Efficiency

We evaluate the average path length over 128 random test maps. Table 4 reports the minimum obstacle distance, average obstacle distance, inter-UAV nearest distance, average path length, and maximum path length. Our method produces the shortest average path length across all scenarios, closely approximating the A* reference baseline. Compared with QMIX, PGL-QMIX reduces the average path length by 8.8%, 12.3%, and 16.1% for 20³, 30³, and 40³, respectively.

Table 4: Path length comparison
Scenario	Algorithm	Min obst dist	Avg obst dist	Inter-UAV min dist	Avg path len	Max path len
20³	PGL-QMIX	2.73	6.03	14.04	58.08	73.25
	No-Prior	2.41	5.93	12.71	61.95	94.75
	No-iLSTM	2.24	5.84	11.09	69.65	99.75
	No-sLSTM	2.24	5.51	11.46	68.84	97.25
	MAPPO	2.41	5.97	11.47	64.84	93.75
	QMIX	2.00	5.22	8.06	73.20	100.75
	VDN	1.71	5.11	6.20	76.73	108.25
30³	PGL-QMIX	2.41	6.54	17.26	92.41	119.25
	No-Prior	2.00	6.32	15.81	100.97	137.00
	No-iLSTM	1.41	5.62	13.85	107.30	156.50
	No-sLSTM	1.73	6.11	12.17	108.79	148.50
	MAPPO	2.00	6.21	16.14	101.84	141.50
	QMIX	1.41	5.41	11.52	114.56	156.25
	VDN	1.41	5.22	10.47	118.79	163.00
40³	PGL-QMIX	2.24	5.86	29.27	119.65	139.25
	No-Prior	1.73	5.08	20.35	134.65	147.25
	No-iLSTM	1.00	4.48	18.15	145.07	170.25
	No-sLSTM	1.41	4.88	17.46	139.07	159.25
	MAPPO	1.73	5.12	19.79	132.98	149.50
	QMIX	1.00	4.50	13.42	148.52	187.50
	VDN	1.00	3.80	12.17	153.71	200.00

The results demonstrate that the combination of prior guidance and temporal fusion leads to more efficient paths with fewer oscillations and shorter detours. The ablation studies show that removing prior knowledge (No-Prior) or individual LSTM (No-iLSTM) significantly degrades performance, while removing system-level LSTM (No-sLSTM) also reduces coordination and increases path length.

Conclusion

In this work, we presented a prior-guided temporal fusion value decomposition algorithm for multi-UAV cooperative obstacle avoidance route planning. By integrating offline A* reference paths as weak priors with a dual-LSTM architecture for temporal modeling, our method achieves faster convergence, higher success rates, and shorter path lengths compared to state-of-the-art baselines. Drone technology stands to benefit greatly from such intelligent cooperative planning in safety-critical applications. Future work will incorporate realistic UAV dynamics and communication-aware cooperative obstacle avoidance to further enhance the practical applicability of drone technology in complex environments. We believe that the proposed PGL-QMIX framework provides a solid foundation for advancing autonomous multi-UAV systems.