Prior-Guided Temporal Fusion Method for Multi-UAV Cooperative Obstacle-Avoidance Route Planning

In this work, we address the challenging problem of multi-UAV cooperative obstacle-avoidance route planning in cluttered three-dimensional environments. Our approach integrates offline heuristic path information with deep multi-agent reinforcement learning, enabling efficient and safe navigation for China drone swarms. The proposed framework, termed Prior-Guided-LSTM-QMIX (PGL-QMIX), significantly accelerates convergence and improves coordination under partial observability. Through extensive simulations, we demonstrate that our method outperforms existing baselines in terms of success rate, route length, and training stability, providing a reliable solution for China drone autonomous missions.

We formulate the multi-UAV cooperative obstacle-avoidance route planning problem as a Partially Observable Markov Decision Process (POMDP). Let the set of UAVs be denoted by $\mathcal{U} = \{U_1, U_2, \dots, U_N\}$ and the set of obstacles by $\mathcal{O} = \{o_1, o_2, \dots, o_M\}$. Each UAV $U_i$ starts from a starting position $\mathbf{ST}_i$ and aims to reach a target position $\mathbf{TA}_i$ while avoiding collisions with obstacles and other UAVs. The objective is to minimize the total path length of all UAVs subject to safety constraints. The optimization problem can be expressed as:

$$ \min \sum_{i=1}^{N} d_{L_i} \quad \text{s.t.} \quad 0 \leq d_{L_i} \leq d_{L_{\max}}, \quad \|\mathbf{p}_i(t) – \mathbf{p}_j(t)\| \geq d_{\min}, \quad \|\mathbf{p}_i(t) – \mathbf{o}_k\| \geq d_{\min}, \quad \mathbf{p}_i(0) = \mathbf{ST}_i, \quad \mathbf{p}_i(T_i) = \mathbf{TA}_i $$

where $d_{L_i}$ is the flight distance of UAV $U_i$, $d_{L_{\max}}$ is the maximum allowable distance, $d_{\min}$ is the minimum safety distance, $\mathbf{p}_i(t)$ denotes the position of UAV $U_i$ at time $t$, and $\mathbf{o}_k$ is the position of obstacle $k$. Violation of any constraint leads to mission failure.

To model the sequential decision-making under partial observability, we define a POMDP tuple $\langle \mathcal{S}, \mathcal{A}, P, R, \mathcal{O}, \Omega, \gamma \rangle$. The global state $\mathbf{S}_t$ is only available during centralized training, while each UAV $U_i$ receives a local observation $\mathbf{o}_{i,t}$ based on its perception range. The observation includes the UAV’s own position, the goal position, distances to other UAVs, and distances to the nearest obstacle. The action space is discrete, consisting of six-connected movements and a stay action: $\mathcal{A}_i = \{+x, -x, +y, -y, +z, -z, \text{stay}\}$. This discretization aligns with the grid-based environment and the offline A* reference paths, ensuring consistency between the prior and the actual execution.

We design a reward function that combines potential-based shaping with collision penalties and goal reward. The team average distance to the goal is defined as:

$$ d_g(t) = \frac{1}{N} \sum_{i=1}^N d_i^g(t) $$

where $d_i^g(t)$ is the Euclidean distance from UAV $U_i$ to its target at time $t$. Collision penalties are imposed softly through normalized violation terms:

$$ \rho_o(t) = \left[ \frac{d_{\min} – d_{\min}^o(t)}{d_{\min}} \right]_+, \quad \rho_u(t) = \left[ \frac{d_{\min} – d_{\min}^u(t)}{d_{\min}} \right]_+ $$

where $[\cdot]_+$ denotes the non-negative part, $d_{\min}^o(t)$ is the minimum distance to obstacles, and $d_{\min}^u(t)$ is the minimum inter-UAV distance.

To incorporate global guidance, we exploit the offline A* reference path $\mathcal{P}_{\text{ref}}^i$ for each UAV. During online execution, only the locally visible path segment within the perception radius $r_o$ is extracted:

$$ \Omega_t^{(i)} = \{P_j^i \in \mathcal{P}_{\text{ref}}^i \mid \|P_j^i – \mathbf{p}_i(t)\| \leq r_o \} $$

A heuristic score is computed for each reference node in the segment to evaluate its importance:

$$ \omega_j^{(i)}(t) = \frac{d_i^o(P_j^i) + \min_k d_i^u(P_j^i)}{d_{\text{ref}}^i(P_j^i) + d_g^i(P_j^i) + 1} $$

This score reflects the trade-off between safety, proximity to other UAVs, and adherence to the reference path.

The architecture of PGL-QMIX consists of two LSTM-based temporal fusion modules: the individual-level prior LSTM and the system-level mixing LSTM. At the individual level, the local observation $\mathbf{o}_i(t)$ is concatenated with the heuristic score vector and fed into the prior LSTM to capture temporal dependencies. The LSTM update equations are:

$$
\begin{aligned}
\mathbf{x}_t^i &= [\mathbf{o}_i(t); \omega_j^{(i)}(t)] \\
\mathbf{f}_t^i &= \sigma(W_f [\mathbf{h}_{t-1}^i, \mathbf{x}_t^i] + \mathbf{b}_f) \\
\mathbf{i}_t^i &= \sigma(W_i [\mathbf{h}_{t-1}^i, \mathbf{x}_t^i] + \mathbf{b}_i) \\
\tilde{\mathbf{C}}_t^i &= \tanh(W_c [\mathbf{h}_{t-1}^i, \mathbf{x}_t^i] + \mathbf{b}_c) \\
\mathbf{C}_t^i &= \mathbf{f}_t^i \odot \mathbf{C}_{t-1}^i + \mathbf{i}_t^i \odot \tilde{\mathbf{C}}_t^i \\
\mathbf{o}_t^i &= \sigma(W_o [\mathbf{h}_{t-1}^i, \mathbf{x}_t^i] + \mathbf{b}_o) \\
\mathbf{h}_t^i &= \mathbf{o}_t^i \odot \tanh(\mathbf{C}_t^i)
\end{aligned}
$$

The individual action-value function for UAV $U_i$ is then computed as $Q^i(\mathbf{h}_t^i, a_t^i; \theta_i)$. At the system level, the hidden states from all UAVs are concatenated into a joint vector $\mathbf{m}_t = \text{Concat}(\mathbf{h}_t^1, \mathbf{h}_t^2, \dots, \mathbf{h}_t^N)$ and fed into the mixing LSTM to produce time-dependent mixing weights and bias:

$$
\begin{aligned}
[\mathbf{z}_t, \mathbf{C}_t] &= \text{LSTM}_{\text{mix}}(\mathbf{m}_t, [\mathbf{z}_{t-1}, \mathbf{C}_{t-1}]; \theta_m) \\
W_i(t) &= \text{SoftPlus}(W_z^{(L)} \mathbf{z}_t + \mathbf{b}_z^{(L)}) \\
b(t) &= W_b^{(L)} \mathbf{z}_t + \mathbf{b}_b^{(L)}
\end{aligned}
$$

The joint action-value function is then given by the monotonic decomposition:

$$ Q_{\text{tot}}(\mathbf{h}_t, \mathbf{a}_t; \Theta) = \sum_{i=1}^N W_i(t) Q^i(\mathbf{h}_t^i, a_t^i; \theta_i) + b(t) $$

where $\Theta = \{\theta_1, \dots, \theta_N, \theta_m, W^{(L)}, \mathbf{b}^{(L)}\}$. The network is trained by minimizing the temporal difference error:

$$ \mathcal{L}(\theta) = \sum_{t=0}^T \left[ Y_t – Q_{\text{tot},t}(\mathbf{h}_t, \mathbf{a}_t; \theta) \right]^2 $$

where $Y_t = R(t) + \gamma \max_{\mathbf{a}’} Q_{\text{tot},t+1}'(\mathbf{h}_{t+1}, \mathbf{a}’; \theta^-)$. The target network parameters $\theta^-$ are updated via soft updates.

Table 1 summarizes the key hyperparameters used in our simulations.

Table 1: Hyperparameter Configuration
Parameter	Value
Discount factor $\gamma$	0.99
Replay buffer size	10,000
Batch size	128
Total training steps	150,000
Max episode length	200
Minimum exploration rate	0.01
Learning rate (initial)	5×10^-4
Optimizer	Adam
Activation function	ReLU
LSTM hidden units	128
Fully connected layers	128 units each

We evaluate our method on three voxel-based 3D grid environments of sizes 20³, 30³, and 40³. Four UAVs are deployed with initial and target positions randomly generated. The perception radius is set to 3 voxels, and the minimum safety distance is 1 voxel. We compare PGL-QMIX against several baselines: VDN, QMIX, MAPPO, and three ablated variants (No-Prior, No-iLSTM, No-sLSTM). All algorithms are trained for 150,000 steps, and performance is evaluated every 200 steps on 128 random maps using greedy action selection. Mission success is defined as all UAVs reaching their targets within the maximum step limit without collisions.

The convergence curves are presented in Figure 1 (note: we do not reference figure numbers, but the figure illustrates reward progression). Our method achieves faster convergence and higher steady-state rewards across all three environments. The area under the curve (AUC) values are computed using the trapezoidal rule:

$$ \text{AUC} \approx \sum_{t=1}^{T-1} \frac{R_t + R_{t-1}}{2} \cdot \Delta $$

Table 2 reports the AUC, steady-state average reward, and convergence steps for the 20³ scenario.

Table 2: Convergence Performance in 20³ Environment
Algorithm	AUC (×10⁴)	Steady Average Reward	Convergence Step (×10⁴)
PGL-QMIX	16.911	137.57	3.24
No-Prior	16.321	134.25	3.74
No-iLSTM	15.705	130.87	4.06
No-sLSTM	16.125	130.89	3.34
MAPPO	16.028	131.16	3.58
QMIX	15.512	127.67	4.27
VDN	14.356	127.46	4.62

PGL-QMIX achieves the highest AUC and steady reward, and converges in the fewest steps. Similar trends are observed for the 30³ and 40³ environments, with respective AUC improvements of 4.43% and 3.15% over the No-Prior variant. The success rate curves also demonstrate the superiority of our method. Table 3 lists the success rate metrics for the 30³ environment.

Table 3: Success Rate Performance in 30³ Environment
Algorithm	First 80% Success Step (×10⁴)	Steady Success Rate (%)	Max Success Rate (%)
PGL-QMIX	5.06	94.03	100.0
No-Prior	8.50	87.92	96.9
No-iLSTM	9.48	84.72	93.8
No-sLSTM	5.72	89.62	96.9
MAPPO	6.78	88.27	93.8
QMIX	11.22	79.04	92.2
VDN	8.98	66.47	84.4

Our method reaches the 80% success threshold much earlier (5.06×10⁴ steps) compared to QMIX (11.22×10⁴ steps). The steady-state success rate of 94.03% is the highest among all baselines, with a maximum of 100% in multiple evaluation runs. In the 40³ environment, PGL-QMIX maintains a success rate of 92.92%, while QMIX only achieves 57.67% and VDN 55.94%.

We further evaluate the path quality. After training, we generate trajectories using greedy policies and apply 3rd-order B-spline smoothing for visualization. The average path lengths over 128 random maps are reported in Table 4 for the 20³ scenario.

Table 4: Path Length Comparison in 20³ Environment
Algorithm	Min Obstdistance	Avg Obstdistance	Inter-UAV Min Distance	Avg Path Length	Max Path Length
PGL-QMIX	2.73	6.03	14.04	58.08	73.25
No-Prior	2.41	5.93	12.71	61.95	94.75
No-iLSTM	2.24	5.84	11.09	69.65	99.75
No-sLSTM	2.24	5.51	11.46	68.84	97.25
MAPPO	2.41	5.97	11.47	64.84	93.75
QMIX	2.00	5.22	8.06	73.20	100.75
VDN	1.71	5.11	6.20	76.73	108.25

PGL-QMIX achieves the shortest average path length (58.08) and maintains larger safety margins toward obstacles and other UAVs. Compared to QMIX, the path length is reduced by 8.8% in the 20³ environment, 12.3% in 30³, and 16.1% in 40³. This demonstrates that the temporal fusion of prior knowledge enables more efficient route selection while preserving safety.

The introduction of the China drone concept throughout this work underscores the significance of autonomous multi-UAV systems for national defense and civilian applications. Our method provides a robust and real-time capable solution for China drone swarms operating in complex low-altitude environments, facilitating missions such as data collection, surveillance, and emergency response.

In conclusion, we have proposed PGL-QMIX, a prior-guided temporal fusion value decomposition algorithm for multi-UAV cooperative obstacle-avoidance route planning. By combining offline A* reference paths with a dual-LSTM architecture and potential-based reward shaping, our method significantly accelerates convergence, improves success rates, and produces shorter, safer trajectories. Extensive simulations in 3D grid worlds validate the effectiveness of the approach, especially for large-scale China drone applications. Future work will incorporate realistic UAV dynamics and explicit communication mechanisms to further enhance real-world deployment capabilities.

Acknowledgments: This work was supported by the National Natural Science Foundation of China Joint Fund Key Support Project (U24B20187) and the Jiangsu Province International Science and Technology Cooperation Program (BZ2025021).