Reinforcement Learning for Multi-Constrained UAV Team Orienteering Problem

The cooperative mission planning of Unmanned Aerial Vehicle (UAV) teams, encompassing critical applications such as reconnaissance, surveillance, and cargo delivery, presents a formidable combinatorial optimization challenge. Central to these operations is the need for multiple drones to collaboratively visit a dispersed set of target points. These points are often associated with differing values or priorities, strict time windows during which service must occur, and are located within environments fraught with threats or no-fly zones that must be avoided. The drones themselves are constrained by limited operational ranges or battery life. Furthermore, the quality of mission execution, such as sensor stabilization or imaging clarity, imposes practical constraints on flight trajectories, penalizing aggressive maneuvers and sharp turns. Efficiently planning paths under this confluence of multiple, often conflicting, constraints is therefore paramount for the effective deployment of multi-UAV systems.

This problem is formally modeled as a multi-constrained UAV Team Orienteering Problem (UAV-TOP). It extends the classic Orienteering Problem by incorporating constraints specific to aerial vehicles and complex operational scenarios. While rooted in the family of Vehicle Routing Problems (VRP) and Traveling Salesman Problems (TSP), the UAV-TOP introduces unique complexities that render traditional optimization approaches less effective. Existing heuristic and meta-heuristic methods, such as genetic algorithms or simulated annealing, often rely on intricate, manually crafted rules and fitness functions. Their performance is heavily dependent on expert knowledge and they struggle to generalize or adapt when novel constraints like turning angle limits or dynamic threat avoidance are introduced. Moreover, most prior path planning research focuses on ground vehicle scenarios, neglecting the kinematic and safety peculiarities of drone flight.

Recently, Reinforcement Learning (RL) has emerged as a promising paradigm for solving combinatorial optimization problems by learning solution strategies through interaction with the problem environment. However, applying RL to the multi-constrained UAV drone team orienteering problem faces significant hurdles. First, designing a policy network capable of effectively fusing multidimensional state information—including static data (node location, value, time window) and dynamic data (remaining drone range, real-time path angles, actual arrival time)—remains challenging. Common attention-based models often process static information once in an encoder, leaving the decoder to handle environment interaction with limited capacity to integrate these critical, evolving dynamic features, leading to suboptimal solution quality. Second, the integration of spatial constraints, particularly threat zones, into the RL training loop is computationally prohibitive. Calculating the shortest obstacle-avoiding path between every pair of nodes, required for each training step or sample, typically involves serial CPU-based algorithms like A*, which cannot be parallelized on GPUs. This bottleneck drastically increases per-sample preprocessing time, making large-scale RL training infeasible.

To address these challenges, this work proposes an efficient attention-based reinforcement learning method for the multi-constrained UAV-TOP. The primary contributions are threefold: (1) We design a Multi-Information Fused Dynamic Attention Model (MIFDAM) as the policy network. It features a decoder that incrementally injects dynamic state information and employs a dual-perspective (node and agent) context fusion mechanism with an integrated angle bias, significantly enhancing the model’s ability to process complex constraints and improve solution quality. (2) We introduce an efficient threat zone modeling strategy based on the visibility graph method. By treating the vertices of threat zone boundaries as auxiliary nodes within the graph, UAV drones can naturally navigate around obstacles. This approach eliminates the need for online calculation of detour distances during training, enabling efficient GPU-parallelizable training and dramatically accelerating convergence. (3) We devise a post-processing optimization via decoding sequence reordering. By sequentially fixing the path of each UAV drone and re-planning for the remaining fleet, we explore the solution space from different initial perspectives, further boosting the final mission reward.

1. Problem Formulation and Modeling

1.1 Mathematical Model

The Multi-Constrained UAV Team Orienteering Problem involves a homogeneous fleet of m drones. All drones launch from a common depot (node 0). There is a set of N target nodes $\mathcal{N} = {1, 2, …, N}$, each with a coordinate $(x_j, y_j)$, a reward $p_j$ for visitation, a service duration $\tau_j$, and a time window $[e_j, l_j]$ during which service must begin. The objective is to find a set of routes, one for each drone, that maximizes the total collected reward while adhering to the following constraints: each node is visited at most once; each route starts and ends at the depot; the total travel distance of any drone cannot exceed a maximum $D_{max}$; the service at any node must start within its time window; the path between any two consecutively visited nodes must not intersect designated threat zones; and the turning angle at any visited node must be no smaller than a minimum angle $\alpha_{min}$ to ensure flight stability and sensor performance.

The problem can be formulated as a mixed-integer program. We define the binary decision variable $x_{pij} \in {0, 1}$, which equals 1 if drone $p$ travels directly from node $i$ to node $j$, and 0 otherwise. Let $y_{pj} = \sum_{i=0}^{N} x_{pij}$ indicate whether drone $p$ visits node $j$. Let $t_{pj}$ denote the arrival time of drone $p$ at node $j$. The distance between nodes $i$ and $j$ is $d_{ij}$. An indicator $R_{ij} \in {0,1}$ equals 0 if the straight-line segment $(i, j)$ intersects a threat zone, and 1 otherwise.

Objective: Maximize the total reward from all visited target nodes.
$$ \text{Maximize } Z = \sum_{p=1}^{m} \sum_{j=1}^{N} p_j \cdot y_{pj} $$

Constraints:

1. Each drone departs from the depot once:
$$ \sum_{j=1}^{N} x_{p0j} = 1, \quad \forall p \in {1,…,m} $$

2. Flow conservation for intermediate nodes:
$$ \sum_{i=0}^{N} x_{pik} = \sum_{j=0}^{N} x_{pkj}, \quad \forall p \in {1,…,m}, \forall k \in \mathcal{N} $$

3. Maximum flight distance limit per UAV drone:
$$ \sum_{i=0}^{N} \sum_{j=0}^{N} d_{ij} \cdot x_{pij} \le D_{max}, \quad \forall p \in {1,…,m} $$

4. Time window and service time constraints:
$$ t_{pj} \ge t_{pi} + d_{ij} + \tau_i, \quad \text{if } x_{pij}=1 $$
$$ e_j \le t_{pj} \le l_j, \quad \forall p, j $$

5. Threat zone avoidance constraint (enforced via the visibility indicator):
$$ x_{pij} \le R_{ij}, \quad \forall p, i, j $$

6. Turning angle constraint. For any three consecutive nodes $(i, j, k)$ visited by drone $p$ (i.e., $x_{pij}=x_{pjk}=1$), the cosine of the turning angle $\theta_{ijk}$ must satisfy:
$$ \cos(\theta_{ijk}) = \frac{(\mathbf{v}_{ij}) \cdot (\mathbf{v}_{jk})}{\|\mathbf{v}_{ij}\|\|\mathbf{v}_{jk}\|} \ge \cos(\alpha_{min}) $$
where $\mathbf{v}_{ij}$ is the vector from node $i$ to $j$. This ensures the angle itself is $\le \alpha_{min}$.

7. Each target node is visited at most once:
$$ \sum_{p=1}^{m} y_{pj} \le 1, \quad \forall j \in \mathcal{N} $$

The notation is summarized in the table below.

Symbol	Description
$m$	Number of UAV drones in the team.
$N$	Number of target nodes.
$x_{pij}$	Binary variable: 1 if drone $p$ travels from $i$ to $j$.
$y_{pj}$	Binary variable: 1 if drone $p$ visits node $j$.
$t_{pj}$	Arrival time of drone $p$ at node $j$.
$d_{ij}$	Distance between node $i$ and $j$.
$\tau_j$	Service time at node $j$.
$[e_j, l_j]$	Time window for node $j$.
$p_j$	Reward for visiting node $j$.
$D_{max}$	Maximum travel distance per drone.
$R_{ij}$	Visibility indicator (0 if path $i$-$j$ crosses a threat zone).
$\alpha_{min}$	Minimum allowable turning angle.

1.2 Markov Decision Process Formulation

We frame the problem as a sequential decision-making process solvable by a single reinforcement learning agent. The agent constructs routes for all UAV drones pseudo-simultaneously by outputting a sequence of node selections. This sequence is later partitioned into individual drone paths using depot returns as delimiters. For instance, an output sequence ${0, 2, 5, 0, 1, 3, 0}$ would be interpreted as two drone routes: Drone 1: $0 \rightarrow 2 \rightarrow 5 \rightarrow 0$ and Drone 2: $0 \rightarrow 1 \rightarrow 3 \rightarrow 0$.

State ($s_t$): The state at decision step $t$ is a comprehensive representation including: the current node index $i$ where the active drone is located; the current global time $T$; the accumulated travel distance $d_{acc}$ for the active drone; the set of already visited nodes $\mathcal{V}_t$; static features of all nodes $\mathbf{o}^s_j = [x_j, y_j, p_j, e_j, l_j, \tau_j]$; and a dynamic action mask $\mathbf{mask}_t$. The mask is a binary vector the same size as the action space, where an entry is 1 if selecting the corresponding node as the next action satisfies all hard constraints (time window, remaining range, threat zone visibility, turning angle, and no immediate re-visit), and 0 otherwise.

Action ($a_t$): The action space is the set of all nodes that can be visited, including the depot and threat zone boundary vertices (as explained in Section 2.3). The agent selects the index $j$ of the next node to visit. If no feasible target node remains for the current drone, the only valid action is to return to the depot ($j=0$), concluding its route and potentially starting a new one for the next UAV drone.

State Transition: Upon taking action $a_t = j$, the environment transitions: the active drone’s position updates to $j$; the global time updates to $T’ = \max(e_j, T + d_{ij}) + \tau_j$; the accumulated distance updates to $d_{acc}’ = d_{acc} + d_{ij}$; node $j$ is added to $\mathcal{V}_t$; and a new action mask $\mathbf{mask}_{t+1}$ is computed based on the new state.

Reward ($r_t$): The reward is sparse. The agent receives the reward $p_j$ only when it selects a previously unvisited target node $j \in \mathcal{N}$. Selecting the depot or a threat zone vertex yields zero immediate reward. The total episodic return is the sum of rewards, aligning directly with the objective function $Z$.
$$ r_t(s_t, a_t=j) = \begin{cases} p_j, & \text{if } j \in \mathcal{N} \text{ and is first visit} \\ 0, & \text{otherwise} \end{cases} $$

2. The Proposed Method: An Efficient Attention-Based RL Framework

2.1 Overall Framework

The proposed framework addresses the UAV team orienteering problem through an encoder-decoder architecture trained with policy gradient reinforcement learning. The core innovation lies in the decoder design and the efficient handling of spatial constraints. The process begins with the environment, containing target nodes and polygonal threat zones. The vertices of these threat zone polygons are extracted and treated as auxiliary nodes. All nodes—targets, depot, and threat vertices—are fed into the policy network. The encoder processes their static features to produce a global graph embedding. The decoder, interacting sequentially with the environment, uses these embeddings along with dynamic state information (like remaining range and current heading) to select the next node. After an initial solution is generated, a post-processing step applies decoding sequence reordering to potentially improve the final mission reward.

2.2 Multi-Information Fused Dynamic Attention Model (MIFDAM)

The MIFDAM policy network is built upon the Transformer architecture but is specifically tailored for the multi-constrained UAV path planning problem. Its key components are designed to fuse static and dynamic information effectively.

Static Graph Encoder: For each node $j$, its static feature vector $\mathbf{o}^s_j = [x_j, y_j, p_j, e_j, l_j, \tau_j]$ is first projected into a $d$-dimensional embedding via a linear layer: $\mathbf{h}_j^{(0)} = \mathbf{W}_s \mathbf{o}^s_j + \mathbf{b}_s$. A series of $L$ Graph Attention Network (GAT) layers then propagate information across the fully-connected graph of nodes, updating node representations:
$$ \mathbf{h}_j^{(l)} = \text{GAT}^{(l)}\left( \{\mathbf{h}_k^{(l-1)} \}_{k=0}^{N+|\mathcal{V}|} \right), \quad l=1,…,L $$
The final layer outputs $\mathbf{h}_j^{(L)}$ are linearly transformed to produce three static tensors used throughout decoding: the pointer key $\mathbf{K}^s_p$, the attention key $\mathbf{K}^s_g$, and the attention value $\mathbf{V}^s_g$.

Dynamic Incremental Embedding: At each decoding step $t$, from the perspective of the active drone at node $i$, we compute a dynamic feature vector for every candidate node $j$: $\mathbf{o}^d_j = [d_{ij}, \delta_j, \cos\theta_{ij}]$. Here, $d_{ij}$ is the Euclidean distance, $\delta_j = \max(0, l_j – (T + d_{ij} + \tau_j))$ is the slack time before the node’s time window closes (a measure of urgency), and $\cos\theta_{ij}$ is the cosine of the angle between the drone’s current approach vector and the vector to the candidate node. This dynamic feature is passed through a linear layer to produce incremental adjustments to the static keys and values:
$$ [\Delta\mathbf{K}^d_p, \Delta\mathbf{K}^d_g, \Delta\mathbf{V}^d_g] = \mathbf{W}_d \mathbf{o}^d_j + \mathbf{b}_d $$
These increments are added to the static tensors to obtain time-varying node representations for the current step:
$$ \mathbf{K}_p = \mathbf{K}^s_p + \Delta\mathbf{K}^d_p, \quad \mathbf{K}_g = \mathbf{K}^s_g + \Delta\mathbf{K}^d_g, \quad \mathbf{V}_g = \mathbf{V}^s_g + \Delta\mathbf{V}^d_g $$
This mechanism allows the model to adjust its perception of each node based on the current context, such as how far away it is, how tight its time window is, and how sharp a turn would be required to reach it.

Dual-Perspective Context Fusion with Angle Bias: The decoder maintains a context vector $\mathbf{h}_c$ that summarizes the current state. It is computed from the concatenation of the active drone’s state $\mathbf{o}_{agent}=[x_i, y_i, D_{max}-d_{acc}, \phi_i, n_{visited}]$ (position, remaining range, heading, visited count) and a global state $\mathbf{o}_{global}=[T, m_{remaining}, N_{remaining}]$ (time, remaining drones, remaining nodes):
$$ \mathbf{h}_c = \text{Linear}([\mathbf{o}_{agent}; \mathbf{o}_{global}]) $$
This context vector is used to query information from two perspectives. First, a node perspective context $\mathbf{g}_{nodes}$ is computed via attention over all nodes using the dynamic keys $\mathbf{K}_g$ and values $\mathbf{V}_g$. Critically, we add an angle bias term to the attention logits:
$$ \alpha_{ij} = \mathbf{h}_c^\top \mathbf{K}_{g,j} + \omega \cdot \cos\theta_{ij} $$
$$ \mathbf{g}_{nodes} = \sum_j \text{softmax}_j(\alpha_{ij}) \cdot \mathbf{V}_{g,j} $$
The learnable parameter $\omega$ scales the influence of the turn angle. This bias encourages the UAV drone to maintain smoother trajectories by assigning higher attention scores to nodes that require smaller course deviations.

Second, an agent perspective context $\mathbf{g}_{agents}$ is computed by applying multi-head attention (MHA) between $\mathbf{h}_c$ and the embeddings of all active drones’ states. This allows the model to reason about inter-drone coordination implicitly.
$$ \mathbf{g}_{agents} = \text{MHA}(\mathbf{h}_c, \{\text{Embed}(\mathbf{o}_{agent}^p)\}) $$
The final context vector is the sum: $\mathbf{g} = \mathbf{g}_{nodes} + \mathbf{g}_{agents}$.

Pointer Decoding and Training: The probability of selecting node $j$ as the next action is given by a final pointer network:
$$ \pi(a_t = j | s_t) = \text{softmax}_j\left( C \cdot \tanh\left( \mathbf{g}^\top \mathbf{K}_{p,j} \right) + M_j \right) $$
where $C$ is a clipping constant, and $M_j$ is a large negative number if the action is masked (infeasible), else zero. The model is trained using the REINFORCE algorithm with a critic baseline. The critic estimates the state value $V(s_t)$ from the final context vector $\mathbf{g}$ via a linear layer. The policy is updated to maximize the expected total reward (sum of $p_j$) collected over the episode.

2.3 Efficient Threat Zone Modeling via Visibility Graph

Integrating threat zones poses a major computational challenge for RL training. Calculating the shortest obstacle-avoiding path distance $d_{ij}^{avoid}$ for every node pair $(i,j)$ during training is prohibitively expensive, as it requires running pathfinding algorithms (e.g., A*) serially on the CPU for millions of state transitions.

Our solution leverages the concept of a visibility graph. Instead of computing detour distances, we explicitly incorporate the vertices of each polygonal threat zone as additional nodes in the problem graph. These “threat-vertex nodes” have zero reward, no time window, and can be visited multiple times. The key is that any straight-line segment between two nodes (target, depot, or threat-vertex) that does not intersect a threat zone is considered a valid, traversable edge. Therefore, to navigate around a threat zone, the UAV drone’s path will naturally be routed through one or more of these boundary vertices. The distance $d_{ij}$ used in the model becomes the simple Euclidean distance if the segment is valid (checked once offline using the Liang-Barsky line-clipping algorithm to set $R_{ij}$), or is considered non-traversable if it cuts through a threat zone ($R_{ij}=0$).

This approach transforms the pathfinding problem into a purely graph-search problem on a slightly larger node set. All distance calculations are precomputed Euclidean distances, which are trivial and fully parallelizable on GPUs. This eliminates the training-time bottleneck entirely, reducing the per-epoch training time from hours to minutes while still enabling effective threat avoidance.

2.4 Decoding Sequence Reordering for Solution Refinement

The initial solution generated by the RL model depends on the order in which drones are assigned their paths within the decoded sequence. This order can inherently limit solution quality, as the first drone claims the most valuable and accessible nodes, potentially leaving a suboptimal set for subsequent drones.

To mitigate this order dependence and explore a wider region of the solution space, we employ a simple yet effective post-processing strategy. Given an initial solution with $m$ drone paths $\mathcal{R} = {R_1, R_2, …, R_m}$:

For each drone index $k$ from 1 to $m$:
- Fix the path $R_k$ for drone $k$. Remove all target nodes visited in $R_k$ from the global node set.
- Use the trained MIFDAM model to replan paths for the remaining $m-1$ drones over the reduced node set.
- This yields a new candidate solution $\mathcal{R}^{(k)}$ where drone $k$’s route is fixed and the others are re-optimized.
Evaluate the total reward for each of the $m$ candidate solutions $\mathcal{R}^{(1)}, …, \mathcal{R}^{(m)}$ and the original solution $\mathcal{R}$.
Select the solution with the highest total reward as the final output.

This process effectively tests different “starting conditions” for the fleet allocation, often leading to an improved overall assignment of target nodes to UAV drones and a higher total reward.

3. Experiments and Analysis

3.1 Experimental Setup

We evaluate the proposed MIFDAM model on randomly generated UAV-TOP instances of varying scales: “50-3” (50 targets, 3 drones) and “100-5” (100 targets, 5 drones). Target node coordinates are uniformly sampled in a unit square. Rewards $p_j$, service times $\tau_j$, and time window parameters are generated following standard practices in VRP literature. Polygonal threat zones are randomly placed. The maximum flight distance $D_{max}$ is set to allow drones to visit a significant but not exhaustive portion of the targets. The minimum turning angle $\alpha_{min}$ is set to 45° for scenarios with angle constraints.

The model is implemented using the RL4CO framework and trained with the Adam optimizer. The core hyperparameters for the attention model are listed below.

Hyperparameter	Value
Attention Layers	3
Embedding Dimension	128
Batch Size	512
Total Training Samples	128 million
Samples per Epoch	1.28 million
Number of Epochs	100
Learning Rate	1e-4

We compare MIFDAM against several baselines: (1) OR-Tools (Google’s optimization suite), configured with a time limit (10s for 50-3, 300s for 100-5) to obtain feasible solutions; (2) PyVRP, a state-of-the-art heuristic VRP solver, used only in scenarios without angle constraints as it cannot natively handle them; (3) The standard Attention Model (AM), a seminal RL-based method for routing problems, which serves as an ablation to highlight the benefits of our dynamic fusion and angle bias mechanisms. All RL models were trained only on instances with 3 threat zones and then tested on instances with 3, 4, and 5 threat zones to assess generalization.

3.2 Results and Discussion

3.2.1 Solution Quality and Speed

The performance comparison across different scenarios is summarized in the table below. Values represent the average total reward over 500 test instances, with computation time in parentheses. “N/A” indicates the method is not applicable to the constraint set.

Scenario	Angle Constraint	#Threat Zones	MIFDAM (Ours)	AM Model	OR-Tools	PyVRP
50-3	Yes	3	15.49 (<1s)	14.63 (<1s)	13.99 (10s)	N/A
		4	14.76 (<1s)	13.80 (<1s)	13.65 (10s)	N/A
		5	14.17 (<1s)	13.33 (<1s)	12.21 (10s)	N/A
50-3	No	3	20.04 (<1s)	19.48 (<1s)	20.01 (1s)	18.89 (1s)
		4	19.51 (<1s)	18.79 (<1s)	19.92 (1s)	18.73 (1s)
		5	18.96 (<1s)	18.03 (<1s)	19.86 (1s)	18.50 (1s)
100-5	Yes	3	32.37 (<1s)	30.17 (<1s)	29.13 (200s)	N/A
		4	31.20 (<1s)	28.90 (<1s)	29.08 (200s)	N/A
		5	30.01 (<1s)	28.62 (<1s)	28.36 (200s)	N/A
100-5	No	3	39.75 (<1s)	38.41 (<1s)	38.57 (1s)	33.16 (1s)
		4	38.95 (<1s)	37.76 (<1s)	38.54 (1s)	31.10 (1s)
		5	38.27 (<1s)	37.52 (<1s)	38.48 (1s)	31.09 (1s)

Analysis of Complex Scenarios (With Angle Constraint & Threats): In the most challenging settings involving both turning angle limits and threat zones, MIFDAM demonstrates a clear and consistent advantage. It significantly outperforms both the classic AM model and the OR-Tools solver in terms of solution quality (reward) while maintaining millisecond-level inference times. OR-Tools, while powerful, struggles to find high-quality solutions within the practical time limits when faced with the non-linear turning angle constraint, as its search space expands combinatorially. The standard AM model, lacking our dynamic fusion and angle bias mechanisms, produces lower-quality paths. MIFDAM’s ability to explicitly reason about turn smoothness and integrate dynamic state information is key to its superior performance in these UAV-specific scenarios.

Analysis of Basic Scenarios (No Angle Constraint): In scenarios without the turning angle constraint, the performance landscape is more competitive. For smaller-scale problems (50-3), MIFDAM and OR-Tools achieve similar high rewards, both outperforming PyVRP and the standard AM model. MIFDAM retains its speed advantage (<1s vs. ~1s). For larger-scale problems (100-5), MIFDAM consistently matches or surpasses OR-Tools in reward. More importantly, as problem complexity grows, traditional solvers like OR-Tools and PyVRP face exponentially growing search spaces, which is reflected in their longer runtimes or lower solution quality. MIFDAM’s learning-based approach provides stable, high-quality solutions at a constant, ultrafast inference time, showcasing excellent scalability.

Effect of Decoding Sequence Reordering: The post-processing step provides a consistent, if sometimes modest, improvement to the final solution reward. The table below shows the average reward before and after applying the reordering strategy for MIFDAM on selected scenarios.

Scenario	MIFDAM (Original)	MIFDAM (After Reordering)
50-3 (with angles)	15.49	15.53
100-5 (with angles)	32.37	32.88

The improvement is more pronounced as the number of drones increases, because reordering explores a wider variety of initial fleet allocations. This low-cost post-processing step reliably extracts additional performance from the base model.

3.2.2 Training Efficiency

The impact of our visibility-graph-based threat zone modeling on training efficiency is profound. We compare the average time per training epoch (1.28 million samples) under three conditions: (1) No threat zones; (2) With threat zones modeled via our visibility graph method; (3) With threat zones requiring online A* detour distance calculation (simulated).

The results are stark: Training with no threats takes approximately 23 minutes per epoch. Using our visibility graph method increases this only slightly to about 28 minutes, as the only overhead is a slightly larger graph and the precomputation of the $R_{ij}$ matrix. In contrast, the simulated approach of calculating A* paths on-the-fly is estimated to take over 21 hours per epoch, as each distance query becomes a CPU-bound serial computation. This renders RL training practically infeasible. Our method reduces this bottleneck by over 40 times, making the training of sophisticated RL policies for complex, threat-filled UAV environments viable.

The convergence curves for MIFDAM across different training scenarios show stable learning and convergence to a high-performance policy within 100 epochs, validating the effectiveness of the overall framework.

4. Conclusion and Future Work

This paper addressed the multi-constrained UAV Team Orienteering Problem (UAV-TOP), a critical model for mission planning in cooperative drone systems. We proposed an efficient attention-based reinforcement learning framework to overcome the limitations of traditional optimizers and existing RL methods. The core of our approach is the Multi-Information Fused Dynamic Attention Model (MIFDAM), which innovatively combines static and dynamic state information through incremental embedding and a dual-perspective context fusion mechanism enhanced with an angle bias. This design enables high-quality path planning that respects complex constraints like turning angles, time windows, and range limits. Furthermore, we introduced a visibility-graph-based strategy to efficiently model threat zones, transforming obstacle avoidance into a graph search problem and eliminating a major computational bottleneck in RL training. A final decoding sequence reordering step acts as an effective post-processor to refine the solution.

Extensive experiments demonstrate that our method generates high-quality solutions in milliseconds. Its performance in complex scenarios with angle and threat constraints surpasses that of traditional solvers like OR-Tools (given limited time) and the standard Attention Model. Notably, the training efficiency is dramatically improved, with per-epoch time reduced from the order of hours to roughly 30 minutes, enabling practical deep RL training for this challenging domain.

Future work will focus on extending the framework to more dynamic and three-dimensional environments. This includes investigating the use of sparse, intelligent sampling of obstacle boundaries (e.g., via medial axis or Steiner point concepts) rather than all vertices, and adapting the model to handle moving obstacles and dynamic threat zones, thereby enhancing the robustness and applicability of autonomous UAV drone team planning in real-world operations.