Intelligent Path Optimization for Cooperative Unmanned Drone Fleets via Gated Attention Networks

The rapid development of smart cities and the low-altitude economy has positioned Unmanned Aerial Vehicles (UAVs), or drones, as transformative agents in modern logistics. Multi-unmanned drone systems, leveraging their three-dimensional mobility, are pivotal for constructing efficient urban aerial delivery networks. In time-sensitive scenarios such as emergency medical supply delivery, these systems can circumvent ground traffic congestion and infrastructure limitations, significantly boosting efficiency and reducing operational costs. However, a fundamental constraint on unmanned drone applications in these integrated low-altitude transport systems is their limited onboard energy, which directly impacts operational range, safety, and overall mission efficacy.

Cooperative path planning for multiple unmanned drones, a key component of this ecosystem, is crucial for optimizing flight costs. In urban logistics scenarios featuring charging stations, an unmanned drone fleet must intelligently coordinate task execution with energy replenishment to address the dynamic decision-making challenges imposed by range limitations. The core optimization problem involves determining the optimal sequence for visiting customer nodes and scheduling recharging stops to minimize the total cost of the cooperative unmanned drone fleet, which is essentially a variant of the NP-hard Vehicle Routing Problem (VRP).

While traditional methods like exact algorithms, genetic algorithms, or ant colony optimization have been applied, they often struggle with computational complexity or sensitivity to parameter tuning. Recently, Deep Reinforcement Learning (DRL) has emerged as a promising data-driven approach, transforming NP-hard problems into sequential decision-making processes guided by reward signals. Building upon the success of DRL and attention mechanisms in solving VRP variants, our research aims to extend these capabilities to the logistics delivery problem with stringent energy constraints for multiple unmanned drones.

Existing research has made significant strides, employing mixed-integer programming, metaheuristics, and layered optimization strategies. However, for the specific scenario involving fixed charging stations, limitations persist: 1) Many methods use a two-stage optimization (path planning then charging scheduling), fragmenting the solution space; 2) Traditional algorithms rely on precise, static energy models; 3) There is room for improvement in energy management robustness and computational efficiency for large-scale dynamic scenes. To address these gaps, we propose a novel Gated Aggregation Network based Deep Reinforcement Learning method (GNDRL) for cooperative unmanned drone fleet path optimization. Our contributions are as follows: First, we design a dual-stage state-enhancement and decision-generation architecture for a multi-unmanned drone协同mode. Second, within the state-enhancement module, we design a feature augmentation method based on a gated attention mechanism. This mechanism dynamically filters and fuses multi-branch attention outputs, actively reinforcing key node information. Gated connections within each sub-layer enable adaptive feature weighting. Third, the decision-generation module incorporates real-time energy states to guide sequence generation, enabling adaptive policy adjustment. Finally, we conduct extensive experiments, including challenging scalability tests, demonstrating our method’s stable performance under varying task and charger densities.

Problem Description and Mathematical Modeling

The core of the multi-unmanned drone cooperative logistics path optimization problem is to achieve global coverage of a delivery network by optimizing the task planning strategy under constraints like battery capacity, based on the spatial distribution of task nodes. This process can be modeled as an extension of the VRP with energy constraints. We formulate a Mixed-Integer Programming (MIP) model to integrate key elements like energy consumption and path optimization.

The system is defined on a directed graph $$G=(V, E)$$. The node set $$V$$ includes a single depot (node 0), a set of $$n$$ delivery task nodes $$N$$, and a set of $$z$$ charging station nodes $$C$$. Each node $$v \in V$$ has attributes (type, visitation flag, 2D coordinates). The edge set $$E$$ connects all node pairs, with each edge $$(i,j) \in E$$ having a weight $$d_{ij}$$ representing the Euclidean distance. A feasible path for an unmanned drone starts at the depot, visits a sequence of task nodes, may visit charging station nodes multiple times based on energy state, and ends at the depot (denoted as node $$n+1$$).

We model the unmanned drone’s physical operation on a 2D plane. It starts from the depot with a maximum flight range $$L_{max}$$. Its position at time $$t$$ is $$p(t)=(x(t), y(t))$$, with remaining range $$Q_t$$. The travel update is: $$p(t+1) = p(t) + d_{t+1}$$, where $$d_{t+1}$$ is the distance traveled. We assume constant cruise speed $$V_{mr}$$. Flight energy consumption $$e_p$$ is linear with distance: proportional to unit distance consumption $$r$$ and traveled distance $$d_{ij}$$. Separate constant energy costs are defined for vertical take-off ($$e_u$$), landing ($$e_d$$), and hovering ($$e_h$$).

To distinguish multiple visits to the same charging station, we create $$s_n – 1$$ virtual nodes co-located with each physical charging station $$s$$. The total number of nodes is:
$$n_{tol} = n + 2 + \sum_{s=0}^{z} n_s$$
Let $$K$$ be the set of unmanned drones. The objective is to minimize the total travel distance for the unmanned drone fleet:
$$\min \sum_{k \in K} \sum_{i \in V_0} \sum_{j \in V_{n+1}, j \neq i} d_{ij} x_{ij}^k$$
where $$x_{ij}^k$$ is a binary decision variable equal to 1 if unmanned drone $$k$$ travels from node $$i$$ to node $$j$$.

The model is subject to the following key constraints:

1. Task Coverage: Each task node must be visited exactly once.
$$\sum_{k \in K} \sum_{j \in V_{n+1}, j \neq i} x_{ij}^k = 1, \quad \forall i \in N$$

2. Charging Station Visits: Each charging station (including virtual nodes) can be visited at most once.
$$\sum_{k \in K} \sum_{j \in V_{n+1}, j \neq i} x_{ij}^k \leq 1, \quad \forall i \in C’$$

3. Flow Conservation: For each task/charging node and each unmanned drone, flow in equals flow out.
$$\sum_{j \in V_{n+1}, j \neq i} x_{ji}^k – \sum_{j \in V_0, j \neq i} x_{ij}^k = 0, \quad \forall i \in V_a, k \in K$$

4. Energy Feasibility (Flight to Task): When flying from task $$i$$ to task $$j$$, the remaining energy $$y_j^k$$ at $$j$$ must be positive and follow the consumption rule.
$$0 < y_j^k \leq y_i^k – (r d_{ij} + e_h) + Q(1 – x_{ij}^k), \quad \forall i \in N, j \in N, i \neq j, k \in K$$

5. Energy Feasibility (Flight to Charger): When flying from task $$i$$ to charger $$j$$.
$$0 < y_j^k \leq y_i^k – (r d_{ij} + e_d) + Q(1 – x_{ij}^k), \quad \forall i \in N, j \in C’, i \neq j, k \in K$$

6. Recharge & Restart: After charging at station $$i$$, the unmanned drone restarts with near-full capacity $$Q$$ minus the energy to reach the next task $$j$$.
$$0 < y_i^k < Q – (r d_{ij} + e_h + e_u), \quad \forall i \in C’ \cup \{0\}, j \in N, i \neq j, k \in K$$

7. Return to Depot (possibly via charger): Similar logic applies for the final return leg.
$$0 < y_i^k < Q – (r d_{ij} + e_d + e_u), \quad \forall i \in C’ \cup \{0\}, j \in C’ \cup \{n+1\}, i \neq j, k \in K$$

8. Subtour Elimination: Ensures a single continuous path per unmanned drone.
$$u_i^k – u_j^k + n_{tol} x_{ij}^k \leq n_{tol} – 1, \quad \forall i,j \in V_a \cup V_{n+1}, i \neq j, k \in K$$

9. Load Update & Capacity: Tracks payload $$l_i^k$$ and ensures it never exceeds capacity $$Q_{cap}$$.
$$l_j^k \geq l_i^k + q_j x_{ij}^k – Q_{cap}(1 – x_{ij}^k), \quad \forall i,j \in V_a, k \in K$$
$$0 \leq l_i^k \leq Q_{cap}, \quad \forall i \in V’, k \in K$$

10. Depot Start and End: Each unmanned drone route must start and end at the depot.
$$\sum_{j \in N \cup C’} x_{0j}^k = 1, \quad \sum_{i \in N \cup C’} x_{i,n+1}^k = 1, \quad \forall k \in K$$

Table 1: Notation Definitions
Variable	Meaning
$$K$$	Set of unmanned drones
$$N$$	Set of task nodes
$$C$$	Set of charging stations
$$C’$$	Set of all charging station visits (incl. virtual nodes)
$$d_{ij}$$	Distance from node $$i$$ to $$j$$
$$x_{ij}^k$$	Binary, 1 if unmanned drone $$k$$ travels $$i \to j$$
$$y_i^k$$	Remaining energy of unmanned drone $$k$$ upon reaching $$i$$
$$Q$$	Battery capacity (max energy)
$$L_{max}$$	Maximum flight distance (range)
$$r$$	Energy consumption per unit distance flown
$$e_u, e_d, e_h$$	Energy cost for take-off, landing, hover

The GNDRL Methodology

MDP Formulation

We frame the sequential solution construction as a Markov Decision Process (MDP).

State Space ($$S$$): The state $$s_t$$ at time $$t$$ is a triplet $$(N_t, Q_t, l_t)$$. $$N_t$$ is the sequence of visited nodes. $$Q_t$$ is the vector of remaining energy for each unmanned drone. $$l_t$$ records the current payload of the active unmanned drone.

Action Space ($$A$$): The set of task nodes that are unvisited and reachable by the active unmanned drone given its current energy and payload.

State Transition ($$P$$): Deterministic based on the chosen action. The state updates by appending the selected node to $$N_t$$, updating $$Q_t$$ and $$l_t$$ accordingly.

Reward ($$R$$): Upon completion, a full solution $$(D_1, D_2, …, D_K)$$ is obtained, where $$D_n$$ is the total travel distance for unmanned drone $$n$$. The reward is the negative total distance, encouraging minimization:
$$R = -\sum_{n=1}^{K} D_n$$

Solution Construction Policy with Gated Aggregation Network

Our policy network is a dual-stage encoder-decoder architecture based on multi-head attention, improved with gating mechanisms for the heterogeneous multi-unmanned drone routing problem.

1. State Enhancement Module (Encoder): This module processes the raw node features (coordinates, type) to generate rich contextual embeddings. It uses a linear projection $$f(X) = XW^T + b$$ to obtain initial node embeddings $$h_i^0$$. The core of the module is $$L$$ layers of Gated Attention Blocks. Each block consists of:

Gated Multi-Head Attention Sub-layer: We employ two parallel attention mechanisms: Self-Attention (SA) over all nodes and a Cross-Attention (CA) focused on task nodes. A learnable gating unit dynamically blends their outputs:
$$\lambda_i = \sigma(W_g \cdot \text{Concat}(SA(h_i), CA(h_i)) + b_g)$$
$$\tilde{h}_i = \lambda_i \cdot SA(h_i) + (1 – \lambda_i) \cdot CA(h_i)$$
where $$\sigma$$ is the sigmoid function. This allows the model to adaptively balance global and task-focused relational information.
Gated Connection & Normalization: The attention output is combined with the layer’s input via another gating mechanism $$g$$ and then batch-normalized (BN).
$$h_i^{‘(l)} = BN(g \cdot \tilde{h}_i^{(l)} + (1-g) \cdot h_i^{(l-1)})$$
Feed-Forward Network (FFN) & Gated Connection: A standard FFN with GELU activation follows, again with a gated residual connection and BN.
$$FFN(h) = W_1 \cdot \text{GELU}(W_0 h + b_0) + b_1$$
$$h_i^{(l)} = BN(h_i^{‘(l)} + FFN(h_i^{‘(l)}))$$

The final graph-level embedding $$h_{graph}$$ is obtained by averaging the last layer’s node embeddings:
$$h_{graph} = \frac{1}{n+z+2} \sum_{i=0}^{n+z+1} h_i^{(L)}$$

2. Decision Generation Module (Decoder): This autoregressive module generates the solution sequence step-by-step. At decoding step $$t$$, the context vector $$h_t^{con}$$ is computed by concatenating the graph embedding, the embedding of the last visited node $$h_{\pi_{t-1}}^{(L)}$$, and the current energy state $$Q_t$$ of the active unmanned drone, followed by a linear projection:
$$h_t^{con} = W_{graph} h_{graph} + W_{con} \text{Concat}(h_{\pi_{t-1}}^{(L)}, Q_t)$$
A multi-head attention layer uses $$h_t^{con}$$ as the query to attend to the encoder’s key-value memories, producing a refined context $$c_t$$. A final single-head attention layer computes compatibility scores for all candidate nodes $$i$$:
$$u_i^t = \begin{cases}
C \cdot \tanh\left(\frac{q_t \cdot k_i}{\sqrt{d_k}}\right) & \text{if node } i \text{ is eligible (unvisited, reachable)} \\
-\infty & \text{otherwise}
\end{cases}$$
where $$q_t = W_Q c_t$$ and $$k_i = W_K h_i^{(L)}$$. The probability of selecting node $$i$$ is given by the softmax over these scores:
$$p_i = \text{softmax}(u_i^t)$$
The next node $$\pi_t$$ is selected either greedily (argmax) or by sampling from this distribution during training.

Training Procedure

We train the GNDRL model using the REINFORCE policy gradient algorithm with a baseline network to reduce variance. The training algorithm follows these steps:

Initialize policy network parameters $$\theta_{\pi}$$ and baseline network parameters $$\theta_b$$.
For each epoch, generate a batch of problem instances.
For each instance, use the current policy $$\pi_{\theta}$$ to construct a complete solution (sequence of visited nodes).
Compute the total reward (negative tour length) $$R(\lambda)$$ for the instance $$\lambda$$.
The baseline network (a copy of the policy network trained using greedy rollouts) provides a baseline reward $$b(\lambda)$$.
The policy gradient is estimated as:
$$\nabla_{\theta} J(\theta) \approx \frac{1}{B} \sum_{\lambda=1}^{B} (R(\lambda) – b(\lambda)) \nabla_{\theta} \log P_{\theta}(\pi | \lambda)$$
where $$B$$ is the batch size.
Update the policy network parameters using the Adam optimizer. Periodically, if the policy network performance significantly surpasses the baseline network, the baseline parameters are updated to the policy parameters.

The time complexity of the GNDRL algorithm can be expressed as $$O(n_{epoch} \cdot n_{batch} \cdot n_{\theta} \cdot C_g)$$, where $$n_{epoch}$$ is training epochs, $$n_{batch}$$ is batch iterations, $$n_{\theta}$$ is the number of parameters, and $$C_g$$ is the per-parameter gradient computation cost.

Experimental Verification and Analysis

We implemented GNDRL in Python 3.9 using PyTorch 1.11.1. Experiments were conducted on a system with an Intel Core i7-14700KF and an NVIDIA GeForce RTX 4070 Ti Super.

Experimental Setup

Data Generation: We created instances of three scales: D20 (20 tasks, 2 chargers), D50 (50 tasks, 10 chargers), and D100 (100 tasks, 20 chargers). Task and depot nodes are uniformly sampled in $$[0,1]^2$$. Charger nodes are placed on a fixed grid {0, 0.25, 0.5, 0.75, 1}$$^2$$ to simulate infrastructure. The maximum flight range $$L_{max}$$ is set to 3, and unit distance energy consumption $$r=1$$.

Hyperparameters: The encoder has 3 layers, 8 attention heads, and embedding dimension 128. Models are trained for 100 epochs with a batch size of 256, using the Adam optimizer with an initial learning rate of $$10^{-4}$$ decaying by 0.995 per epoch.

Baselines: We compare against classical metaheuristics: Variable Neighborhood Search (VNS), Simulated Annealing (SA), Genetic Algorithm (GA), Particle Swarm Optimization (PSO). We also compare against two DRL-based methods: a standard Attention Model (AM) and a Heterogeneous Attention DRL model (HADRL). For fairness, all DRL models share similar architecture scales.

Results and Analysis

1. Training Convergence: Figure 1 shows the training curves (objective value vs. steps). GNDRL converges stably across all scales. While HADRL performs well on D20, GNDRL shows superior generalization on larger D50 and D100 instances, achieving final values near 11.25 and 15.5 respectively, outperforming HADRL and AM by 8.9% to 17.6%.

2. Ablation Study: We ablated the two key components: Gated Fusion Attention (GFA) and Gated Connections (GC). Results in Figure 2 show that both components contribute to performance, especially on larger instances. The full GNDRL model consistently achieves the lowest path length.

3. Performance Comparison: Table 2 summarizes the results on D20, D50, D100 instances, showing the average tour length (objective) and computation time. GNDRL achieves the best solution quality across all scales.

Table 2: Algorithm Performance Comparison on Standard Instances
Method	D20 (Obj/Time)	D50 (Obj/Time)	D100 (Obj/Time)
VNS	8.07 / 0.16s	20.08 / 1.16s	41.52 / 4.99s
SA	12.15 / 0.01s	32.14 / 0.01s	61.88 / 0.04s
GA	7.55 / 8.29s	20.60 / 56.91s	39.36 / 254s
PSO	8.30 / 0.99s	22.57 / 1.20s	46.31 / 4.67s
AM	7.45 / 0.13s	9.16 / 0.16s	12.70 / 0.30s
HADRL	6.74 / 0.15s	8.52 / 0.16s	12.66 / 0.27s
GNDRL	6.34 / 0.15s	8.49 / 0.17s	11.20 / 0.27s

GNDRL finds paths approximately 10% shorter than the best DRL baselines. While metaheuristics like GA can find good solutions, their computation time grows prohibitively (254s for D100). All DRL methods, including GNDRL, solve instances in under 0.3 seconds by leveraging pre-trained models, showcasing superior computational efficiency.

4. Visualization: Figure 3 visualizes planned paths for D100. GNDRL produces coherent, non-crossing routes where unmanned drones naturally weave task visits with efficient charging stops. In contrast, paths from AM and HADRL show more crossing and detours, especially in large-scale scenarios.

Generalization and Robustness Tests

We conducted extensive tests to evaluate GNDRL’s generalization capability.

1. Scalability to Larger Instances: We tested the model trained on D100 on much larger instances D150 (150 tasks, 30 chargers) and D200 (200 tasks, 35 chargers). Results in Table 3 confirm that GNDRL maintains its advantage, producing high-quality solutions in under 0.5 seconds, while traditional methods struggle with time or quality.

Table 3: Performance on Large-Scale Instances (D150, D200)
Method	D150 Obj/Time	D200 Obj/Time
GA	63.86 / 606s	82.13 / 1043s
AM	18.89 / 0.38s	22.75 / 0.49s
HADRL	19.10 / 0.41s	18.60 / 0.49s
GNDRL	15.28 / 0.39s	18.40 / 0.49s

2. Sensitivity to Energy and Charger Density: We tested the model under tighter energy constraints (90% of original battery capacity) and reduced charger density. Table 4 shows that with 90% battery, GNDRL’s performance degrades only slightly (from 6.34 to 6.55 on D20). When the number of chargers is reduced (e.g., from 10 to 6 in a D50-area), the planned paths remain efficient, demonstrating robust energy management.

Table 4: Robustness Under Energy Constraints
Scenario	Objective Value	Time
D20 (100% Battery)	6.34	0.15s
D20 (90% Battery)	6.55	0.21s

3. Generalization to Different Data Distributions: We tested the model (trained on uniform data) on instances generated from Gaussian and Rayleigh distributions with varying densities. Tables 5 and 6 show that GNDRL consistently outperforms AM and HADRL across these heterogeneous distributions, proving its strong generalization ability. On non-uniform Rayleigh distributions, GNDRL’s path length is up to 12% shorter than the best baseline.

Table 5: Performance under Gaussian Distribution
Method	G50 (std=0.3)	G100 (std=0.6)
AM	9.86	16.40
HADRL	8.57	14.58
GNDRL	8.39	14.01

Table 6: Performance under Rayleigh Distribution
Method	R50 (scale=0.25)	R100 (scale=0.4)
AM	24.40	12.96
HADRL	7.93	11.63
GNDRL	5.85	11.50

Conclusion

In this work, we have addressed the multi-unmanned drone cooperative path optimization problem for logistics delivery with charging stations. We proposed GNDRL, a novel deep reinforcement learning method featuring a gated aggregation network. The state enhancement module utilizes gated attention mechanisms to dynamically fuse heterogeneous node information, while the decision generation module effectively incorporates real-time energy states. This integrated approach allows for the joint optimization of task sequencing and charging scheduling within a single end-to-end learning framework.

Extensive experiments demonstrate that GNDRL outperforms both classical metaheuristics and state-of-the-art DRL baselines in terms of solution quality (achieving up to 12% shorter paths on non-uniform distributions) while maintaining competitive sub-second computation times. The method exhibits strong generalization to larger scales, tighter energy constraints, and different spatial distributions of tasks, proving its robustness and practical potential.

For future work, we plan to extend the framework to more dynamic scenarios, such as handling temporary charging station failures or incorporating online re-planning mechanisms. Integrating more realistic energy models that account for payload-dependent consumption and adding communication constraints between unmanned drones are also important directions to enhance the applicability of our approach to real-world urban air mobility operations.