An Integrated Deep Reinforcement Learning and Test-Time Augmentation Approach for the Traveling Salesman Problem with Unmanned Drone

The rapid emergence of the low-altitude economy has fundamentally transformed logistics and service delivery paradigms. Central to this transformation is the deployment of unmanned drone systems, which offer unparalleled advantages in speed, flexibility, and accessibility, particularly for last-mile delivery. However, an unmanned drone’s operational efficacy is often constrained by limited battery life and payload capacity. A synergistic solution involves pairing an unmanned drone with a truck, creating a coordinated delivery system where the truck acts as a mobile depot, carrying both parcels and the unmanned drone to launch points near customer clusters. This collaboration, formally modeled as the Traveling Salesman Problem with Drone (TSPD), aims to minimize the total completion time for serving all customers by optimally sequencing the routes of both vehicles. The TSPD is a combinatorial optimization problem known for its significant complexity, large decision space, and intricate synchronization constraints between the truck and the unmanned drone. While traditional exact and heuristic algorithms have been applied, they often struggle with scalability and real-time adaptability in dynamic, large-scale scenarios. Deep Reinforcement Learning (DRL) has recently emerged as a powerful alternative, demonstrating strong potential for learning effective policies directly from data. This study proposes a novel hybrid DRL framework that integrates a multi-head attention encoder with an LSTM decoder for policy learning and incorporates a Test-Time Augmentation (TTA) strategy during inference to enhance solution quality and robustness.

The TSPD involves a single truck and a single unmanned drone operating from a common depot. The objective is to serve a set of customer nodes exactly once, with either the truck or the unmanned drone, and minimize the total time for both vehicles to return to the depot. We model the operational area as a complete graph $G = (V, A)$, where $V = {1} \cup {2, 3, …, N_c}$ is the set of nodes (node 1 is the depot). Each node $i$ has coordinates $X_i = (x_i, y_i) \in \mathbb{R}^2$. The travel time for the truck and unmanned drone between nodes $i$ and $j$ is proportional to the Euclidean distance $d_{ij} = ||X_i – X_j||_2$, with speeds $S_{tr}=1$ and $S_{dr} \ge 1$, typically $S_{dr}=2$. The unmanned drone can be launched from and recovered at the depot or any node already visited by the truck. Each launch of the unmanned drone serves exactly one customer before it must return to the truck. The truck must wait at the recovery node for the unmanned drone. We ignore service, loading/unloading, and battery swap times, focusing on the routing and synchronization challenge.

To solve this via DRL, we formulate the TSPD as a Markov Decision Process (MDP). The state $s_t$ at step $t$ is defined by the set of visited nodes $V_t$, the current destination nodes for the truck and unmanned drone ($m_t^{tr}, m_t^{dr}$), and the remaining travel times to those destinations ($r_t^{tr}, r_t^{dr}$). The action $a_t = (a_t^{tr}, a_t^{dr})$ specifies the next destination for each vehicle. If a vehicle is en route, its action is constrained to be its current destination. The time progression $O_t$ for a step is the minimum time until any vehicle reaches its destination or a new action is initiated:
$$ O_t = \min(r_t^{tr}, r_t^{dr}, \phi_{m_t^{tr},a_t^{tr}}^{tr}, \phi_{m_t^{dr},a_t^{dr}}^{dr}) $$
where $\phi_{i,j}^{v} = d_{ij} / S_v$ for $v \in \{tr, dr\}$. The total mission time is the sum of step durations: $O = \sum_{t=0}^{T} O_t$, which is the reward to be minimized (negative reward).

Our solution is a policy network $\pi_\theta$ with an encoder-decoder architecture trained using the Advantage Actor-Critic (A2C) algorithm. The encoder transforms node coordinates into high-dimensional embeddings. It consists of $L$ identical layers, each containing a multi-head attention (MHA) mechanism followed by a feed-forward network (FFN), with residual connections and layer normalization. For a node $i$ with initial embedding $\mathbf{h}_i^{(0)}$, the layer $l$ computes:
$$ \mathbf{h}_i^{(l)’} = \text{MHA}(\text{LN}(\mathbf{h}_i^{(l-1)}), \{ \mathbf{h}_1^{(l-1)}, …, \mathbf{h}_N^{(l-1)} \}) $$
$$ \mathbf{h}_i^{(l)} = \text{FFN}(\text{LN}(\mathbf{h}_i^{(l)’} + \mathbf{h}_i^{(l-1)})) + \mathbf{h}_i^{(l)’} + \mathbf{h}_i^{(l-1)} $$
The MHA allows the model to capture diverse spatial dependencies between nodes, which is crucial for understanding the relative positioning of customers for the truck and the unmanned drone.

The decoder is responsible for sequentially generating the route. It uses an LSTM to maintain a hidden state that encodes the history of decisions and the current state of the system (e.g., locations of vehicles, visited nodes). At decoding step $t$, the LSTM takes the embedding of the last chosen node and its previous hidden state to output a context vector. This context is combined with the graph embedding (mean of all node embeddings) and a mask representing feasible actions to compute a probability distribution over the next node to be served by either vehicle. The use of LSTM is intentional; its gated memory mechanism is adept at modeling the strong temporal dependencies and state evolution inherent in constructing a cooperative route for the truck and unmanned drone, more so than a Transformer decoder in this sequential decision-making context.

The policy $\pi_\theta(a_t | s_t)$ defined by this network is optimized using the A2C framework. The objective is to maximize the expected cumulative reward (minimize expected time). The gradient is estimated as:
$$ \nabla_\theta J(\theta) \approx \mathbb{E}_{\pi_\theta} \left[ \sum_t \nabla_\theta \log \pi_\theta(a_t | s_t) A(s_t, a_t) \right] $$
where $A(s_t, a_t)$ is the advantage function, calculated as the difference between the return (cumulative reward) and a state-value baseline $V_\psi(s_t)$ estimated by a critic network. This advantage formulation reduces variance and stabilizes training. The critic network shares the encoder and has a separate value head.

A key innovation of our work is the application of Test-Time Augmentation (TTA) during inference. Unlike traditional methods that perform a single forward pass, TTA generates multiple predictions for a single problem instance by applying invariant transformations to the input coordinates. For the TSPD with Euclidean distances, transformations such as rotations (90°, 180°, 270°) and reflections about a line (e.g., $x=1$) preserve all inter-node distances. Therefore, the transformed problem is mathematically identical to the original. However, the neural network’s policy, which is an approximation, may yield different solution trajectories for different geometric representations of the same graph. By generating a set of candidate solutions $\{\tau_1, \tau_2, …, \tau_K\}$ via these transformations and selecting the one with the minimum total time $O^* = \min(O(\tau_1), …, O(\tau_K))$, we effectively perform a robust ensemble over the policy’s output space. This simple, zero-retraining technique significantly boosts performance and reliability.

We conduct extensive experiments on benchmark instances of size $n = 20, 50,$ and $100$ customers. The models are trained on randomly generated instances and evaluated on a fixed test set. We compare our method against several baselines: a Hybrid Genetic Algorithm (GA), the exact Branch-and-Bound method (DPS) with a node limit, the heuristic TSP-ep-all, and several DRL approaches (NCO, HM, Zeng, Xiao). The primary metrics are the average total time (Obj) and the computation time (Time).

Table 1: Comparative Results on TSPD Instances
Model	TSPD-20			TSPD-50			TSPD-100
	Obj	Gap	Time	Obj	Gap	Time	Obj	Gap	Time
GA	295.49	3.48%	0.5s	411.47	0.71%	3s	568.13	0.64%	14s
Zeng (DRL)	285.80	0.09%	0.1s	407.04	-0.37%	0.3s	564.36	-0.02%	0.5s
DPS/10	292.23	2.29%	0.1s	420.51	2.93%	0.2s	570.74	1.10%	0.3s
Xiao (DRL)	286.88	0.47%	0.1s	410.21	0.41%	0.2s	567.82	0.58%	0.5s
HM (DRL)	286.16	0.22%	0.1s	411.50	0.72%	0.3s	566.51	0.35%	0.5s
Ours (TTA)	285.54	0.00%	0.1s	408.54	0.00%	0.3s	564.52	0.00%	0.5s

The results demonstrate the effectiveness of our approach. Our model consistently achieves highly competitive solution quality, outperforming most other DRL methods and traditional heuristics like GA and limited DPS across all scales. While the TSP-ep-all heuristic can find slightly better solutions given significantly more computation time (hours for n=100), our DRL+TTA method produces near-optimal solutions in a fraction of a second, highlighting its superior computational efficiency and suitability for real-time or large-scale applications. The integration of the unmanned drone’s flexibility is thus captured efficiently by the learned policy.

To validate the contribution of TTA, we perform an ablation study comparing the policy with and without TTA across different unmanned drone-to-truck speed ratios ($S_{dr}/S_{tr}$). The results confirm that TTA consistently improves solution quality regardless of the speed ratio or problem size.

Table 2: Ablation Study on TTA Effectiveness (Objective Value)
Speed Ratio	TSPD-20		TSPD-50		TSPD-100
(Drone/Truck)	No TTA	With TTA	No TTA	With TTA	No TTA	With TTA
1.5	319.20	318.60	465.23	464.98	635.86	633.21
2.0	286.16	285.54	411.50	408.54	566.51	564.52
2.5	278.45	277.88	397.44	394.21	541.56	538.51
3.0	277.15	276.96	389.55	386.18	530.30	527.53

We also ablate core architectural choices. Replacing the multi-head attention encoder with a single-head version leads to a noticeable performance drop (Gap +1.21% on n=20), underscoring the importance of capturing diverse spatial relationships for the unmanned drone and truck routing. Substituting the LSTM decoder with a Transformer decoder also results in slightly worse performance (Gap +0.15%), validating our design choice for sequential decision-making. Furthermore, training with standard Actor-Critic (without advantage) shows slower and less stable convergence compared to our A2C-based training, as illustrated in the learning curves.

In conclusion, this paper presents a novel and effective framework for solving the Traveling Salesman Problem with an unmanned drone. By integrating a hybrid attention-LSTM policy network with a Test-Time Augmentation strategy, we achieve an excellent balance between solution quality and computational speed. The method outperforms several traditional and contemporary DRL algorithms, demonstrating its potential for real-world logistics planning where efficient coordination between a truck and an unmanned drone is paramount. Future work will focus on incorporating more realistic constraints, such as unmanned drone battery limits and dynamic customer demands, and extending the framework to multi-unmanned drone systems.