A comparison of state-of-the-art reinforcement learning algorithms applied to the traveling salesman problem

Kenneth Schröder; Alexander Kastius; Rainer Schlosser; Kenneth Schröder; Alexander Kastius; Rainer Schlosser

doi:10.48130/ker-0026-0001

Combinatorial optimization problems are highly relevant for real-world applications. For complex problems, the use of exact solution techniques is limited to small problem sizes, and, hence, effective heuristic approaches are needed. Furthermore, most approaches require that for different input data, solutions have to be computed individually for each problem instance. Recent developments in Reinforcement Learning (RL) provide promising alternatives, as they allow for heuristic out-of-the-box solutions for arbitrary input data after being trained. Transformer-based RL approaches even have the capability to generalize with regard to the problem size and allow for the provision of quick solutions for problems that are larger than they have been trained on. However, despite their potential, the amount of different RL algorithms is large, and their performance for combinatorial optimization problems is unclear. To resolve this issue, the performance of different state-of-the-art RL algorithms are compared when applied to the classical Traveling Salesman Problem (TSP), and the Orienteering Problem (OP). Some RL algorithms are found to achieve promising results with: (i) near-optimal performance compared to optimal solutions for single tractable problem instances; while (ii) providing the capability to generalize regarding both the input data (continuous coordinates), and the problem size (number of nodes).

HTML

1. Introduction

This paper compares the performance of a variety of RL algorithms on the Traveling Salesman problem (TSP), and the Orienteering Problem (OP), and examines their generalization capabilities with regard to problem instances of untrained sizes. In the following, the two core concepts of combinatorial optimization problems, and RL solution algorithms with the capability to generalizd are introduced and motivated. The contributions of this paper, as well as its structure is summarized in section "Contributions".

1.1. Combinatorial optimization problems
Combinatorial optimization problems are mathematical optimization tasks that involve selecting a subset from a finite set of discrete elements such that it maximizes or minimizes a specific metric under given constraints. A famous example of such a combinatorial optimization problem is the knapsack problem. The goal is to select a subset of items that have sizes and values such that a knapsack with limited capacity can be filled to a maximal value. Combinatorial optimization problems like this have significant implications for real-life use cases. Solutions to the knapsack problem, for example, can be used to improve transportation by maximizing value in deliveries with limited capacity, to improve inventory management by deciding which items to keep in a warehouse of fixed size, or to determine what data streams to prefer in limited bandwidth applications. This paper focuses on the two combinatorial optimization problems: the Traveling Salesman Problem (TSP), and the Orienteering Problem (OP), which will be introduced in detail in subsequent sections.

1.1.1. Traveling salesman problem (TSP)
The TSP is traditionally a graph combinatorial optimization problem, where a fully-connected graph with edge weights and n nodes is considered. The objective of the TSP is to find a minimum Hamiltonian cycle, i.e., a closed-loop in the graph that visits each node exactly once and returns to the initial node with the smallest possible cumulative edge weight. A subset of all TSP instances can also be represented as metric problems, where instead of a graph, node coordinates are given, and all of the distances are inferred based on the metric distances. Popular optimal solvers for TSP are Concorde (see Applegate et al.^[1]), and the general-purpose solver Gurobi^[2]. The TSP is NP-hard^[3], which means optimal solvers may have exponential time complexity and, in general, do not scale well with the problem size.

For approaching larger TSP instances, several heuristics have been developed^[4], but they may only achieve satisfactory results on specific distributions of problem instances, cf.^[5]. In the real world, the TSP has applications in delivery planning, for example, where better solutions can lead to significant benefits in time efficiency and environmental sustainability. See examples of practical variants of TSP in the citations here^[6−10].

1.1.2. Orienteering problem (OP)
The second combinatorial optimization problem covered by this article is the OP^[11]. Like the TSP, the OP is also originally a graph combinatorial optimization problem based on fully-connected graphs with edge weights. In addition to that, OP instances incorporate a depot node, a range value, and node values that can be considered prizes. The objective in the OP is to find a tour that maximizes the cumulative prizes of the visited nodes while starting and ending at the depot node, with cumulative edge weights smaller than or equal to the given range value. As for the TSP, metric problem definitions exist for the OP. The OP is also referred to as the Generalized Traveling Salesman Problem^[12], and is proven to be NP-hard as it contains the TSP as a particular case^[11]. The Gurobi optimization suite, and the RB & C TSP solver, cf.^[13], are examples for algorithms that can solve OP instances optimally. The Compass algorithm, cf.^[14], is an example of a heuristic approximator for solutions to the OP. As the name implies, the OP can be used in the real world to optimize orienteering in an unknown environment. Exploration like this could be carried out by humans within a limited time, representing the range value in this case. It could also be performed by a drone, for example, with a range limited by the battery size.

Calculating exact solutions to combinatorial optimization problems is often challenging because many of those problems are NP-hard, i.e., solutions in polynomial time are not known. Therefore, all optimal solvers have at least an exponential time complexity, making them less applicable for large problem instances. Because of this, sizeable combinatorial optimization problem instances are usually approached by complex, expert-crafted heuristics and approximations. In recent years, RL has gained popularity and is promising to become a powerful alternative or extension to expert-crafted solutions, especially due to the possibility of applying trained agents to untrained problem sizes.

1.2. Generalizing solutions using reinforcement learning (RL)
RL is a field of machine learning concerned with training a machine learning model (cf. agent) to take actions in a Markov decision process (cf. environment)^[15] based on observations of the environment's state to maximize the expected, cumulative discounted rewards. The idea of RL has been around since the 1950s, but received a surge in popularity in 2013 when researchers of DeepMind demonstrated an algorithm that could learn most of the Atari games from scratch. This algorithm was called Deep Q-Learning (DQN), and will also be covered in this paper. Since then, various RL algorithms and extensions have been developed for all kinds of applications.

In this paper, RL will be applied and compared to the combinatorial optimization problems TSP and OP. In the future, breakthroughs in this direction could drastically reduce carbon emissions by optimizing transportation chains, for example. As an alternative to RL, supervised learning requires expensive, optimal solutions for learning to approximate combinatorial optimization problems and would effectively learn to replicate the optimal solver used to generate the training data. In addition, there might be multiple optimal solutions to single problem instances, which the model would only be aware of if it were trained on many of them. RL is a promising approach as it only requires a suitable reward function, and some training data. Over the training process, the model incrementally builds better solutions by itself, and has the capability to provide out-of-the-box solutions for unseen problem instances. Downsides of RL are the opaque selection of algorithms and their extensions, and the more complicated tuning compared to Supervised Learning. In contrast, the sequential solution generation of RL algorithms entails unique generalization capabilities with respect to problem sizes.

1.3. Contributions
Most RL research just presents a single RL approach without detailed information on why a particular algorithm was chosen. Additionally, even though the generalization capabilities of RL for combinatorial optimization problems have been mentioned in other sources, the topic has not been covered in detail. The main goal of this paper is to give a comparison of multiple RL algorithms on the TSP and OP, and to examine their generalization capabilities while giving possible reasons for the observable differences.

The main contributions of this paper are the following.

(1) The performance of state-of-the-art RL algorithms are studied and compared when applied to the TSP and the OP, which have the capabilities to generalize, i.e., after pretraining, are able to provide fast out-of-the-box solutions for unseen problem instances.

(2) The tuning of selected RL algorithms are optimized and their performance compared on the TSP and OP with existing RL combinatorial approximators as well as optimal solver-based solutions (for single tractable problem sizes).

(3) Further, the quality of their generalization capabilities regarding the problem size, i.e., when applied to larger problem sizes that have not been seen in training, are analyzed.

2. Related work, RL algorithms, and ML architectures for combinatorial optimization problems

3. Evaluation and performance comparison

4. Capabilities of RL algorithms to generalize to larger problem instances

5. Discussion of the main results, limitations, and future work

5.1. Discussion of main results

The main results can be summarized as follows, cf. Remarks 1−4:

● The generalizing RL solutions of vanilla PG by Kool et al.^[27], as well as DQN and SAC perform well on the TSP and the OP for smaller problems (with e.g., 20 nodes).

● In the present experiments, SAC shows the best generalization capabilities for larger problems while being trained on smaller problem instances.

● Compared to optimal solutions, it is obtained that, e.g., a trained SAC agent computes TSP solutions of size 100 about 20× faster, while providing a decent optimality gap of below 20%.

● In contrast to optimal solutions, a trained SAC agent remains directly applicable to larger problems. For instances with, e.g., 1,000 nodes, the runtime is about 2 s.

● Further, it was found that agents trained on varied problem sizes tend to generalize significantly better.

5.2. Limitations, potential extensions, and future work
This paper compares and evaluates the performance of selected RL algorithms for approximating solutions to combinatorial optimization problems. However, this work does not mark the end of research in this direction. Future work could expand on the range of algorithms selected in this paper, implement more exhaustive hyperparameter optimization solutions, and conduct detailed property-specific analysis. By comparing a wider range of algorithms, the bias-variance tradeoff could be analyzed in detail between all intermediate variants of the PG family or between DQN variants with other extensions like Dueling DQN.

An additional option that could improve the robustness of each algorithm would be an adversarial setup as proposed by Pinto et al.^[56], where an adversary tries to manipulate the problem sizes to values that the agent performs worst on. Further, the sensitivity of individual hyperparameters could be quantified more specifically by conducting more extensive experiments, as in the study by Liessner et al.^[57], for example. Furthermore, algorithms that generalize well to larger problem instances are desirable for RL on TSP and OP. This is because training on large problem instances would be much more time-consuming than training on small problems and applying the model to larger instances. Therefore, future work could conduct additional experiments for increasing the robustness of RL algorithms by implementing ideas like training on multiple problem sizes or developing an adversarial setup.

[1]	Applegate D, Bixby RE, Chvátal V, Cook WJ. 2019. Concorde TSP Solver. www.math.uwaterloo.ca/tsp/concorde.html (Accessed 16 May 2024)
[2]	Gurobi Optimization, LLC. 2022. Gurobi Optimizer Reference Manual. www.gurobi.com
[3]	Flood MM. 1956. The traveling-salesman problem. Operations Research 4(1):61−75 doi: 10.1287/opre.4.1.61 CrossRef Google Scholar
[4]	Rosenkrantz DJ, Stearns RE, Lewis PM. 1977. An analysis of several heuristics for the traveling salesman problem. SIAM Journal on Computing 6(3):563−581 doi: 10.1137/0206041 CrossRef Google Scholar
[5]	Mazyavkina N, Sviridov S, Ivanov S, Burnaev E. 2021. Reinforcement learning for combinatorial optimization: a survey. Computers & Operations Research 134:105400 doi: 10.1016/j.cor.2021.105400 CrossRef Google Scholar
[6]	Chen D, Imdahl C, Lai D, Van Woensel T. 2025. The Dynamic Traveling Salesman Problem with Time-Dependent and Stochastic travel times: a deep reinforcement learning approach. Transportation Research Part C: Emerging Technologies 172:105022 doi: 10.1016/j.trc.2025.105022 CrossRef Google Scholar
[7]	Lähdeaho O, Hilmola OP. 2024. An exploration of quantitative models and algorithms for vehicle routing optimization and traveling salesman problems. Supply Chain Analytics 5:100056 doi: 10.1016/j.sca.2023.100056 CrossRef Google Scholar
[8]	Li J, Ma Y, Gao R, Cao Z, Lim A, et al. 2022. Deep reinforcement learning for solving the heterogeneous capacitated vehicle routing problem. IEEE Transactions on Cybernetics 52(12):13572−13585 doi: 10.1109/TCYB.2021.3111082 CrossRef Google Scholar
[9]	Zhang R, Prokhorchuk A, Dauwels J. 2020. Deep reinforcement learning for traveling salesman problem with time windows and rejections. 2020 International Joint Conference on Neural Networks (IJCNN). July 19−24, 2020. Glasgow, United Kingdom. USA: IEEE. pp. 1−8 doi: 10.1109/ijcnn48605.2020.9207026
[10]	Zhang R, Zhang C, Cao Z, Song W, Tan PS, et al. 2023. Learning to solve multiple-TSP with time window and rejections via deep reinforcement learning. IEEE Transactions on Intelligent Transportation Systems 24(1):1325−1336 doi: 10.1109/TITS.2022.3207011 CrossRef Google Scholar
[11]	Golden BL, Levy L, Vohra R. 1987. The orienteering problem. Naval Research Logistics 34(3):307−318 doi: 10.1002/1520-6750(198706)34:3307::aid-nav3220340302>3.0.co;2-d CrossRef Google Scholar
[12]	Tsiligirides T. 1984. Heuristic methods applied to orienteering. Journal of the Operational Research Society 35(9):797−809 doi: 10.1057/jors.1984.162 CrossRef Google Scholar
[13]	Kobeaga G, Merino M, Lozano JA. 2020. A revisited branch-and-cut algorithm for large-scale orienteering problems. arXiv 2011.02743 doi: 10.48550/arXiv.2011.02743 CrossRef Google Scholar
[14]	Kobeaga G, Merino M, Lozano JA. 2018. An efficient evolutionary algorithm for the orienteering problem. Computers & Operations Research 90:42−59 doi: 10.1016/j.cor.2017.09.003 CrossRef Google Scholar
[15]	Bellman R. 1957. A Markovian decision process. Indiana University Mathematics Journal 6(4):679−684 doi: 10.1512/iumj.1957.6.56038 CrossRef Google Scholar
[16]	Karp RM. 1977. Probabilistic analysis of partitioning algorithms for the traveling-salesman problem in the plane. Mathematics of Operations Research 2(3):209−224 doi: 10.1287/moor.2.3.209 CrossRef Google Scholar
[17]	Traub V, Vygen J. 2024. Approximation Algorithms for Traveling Salesman Problems. Cambridge, UK: Cambridge University Press. doi: 10.1017/9781009445436
[18]	Strutz T. 2021. Travelling santa problem: optimization of a million-households tour within one hour. Frontiers in Robotics and AI 8:652417 doi: 10.3389/frobt.2021.652417 CrossRef Google Scholar
[19]	Valenzuela CL, Jones AJ. 1993. Evolutionary divide and conquer (I): a novel genetic approach to the TSP. Evolutionary Computation 1(4):313−333 doi: 10.1162/evco.1993.1.4.313 CrossRef Google Scholar
[20]	Liao E, Liu C. 2018. A hierarchical algorithm based on density peaks clustering and ant colony optimization for traveling salesman problem. IEEE Access 6:38921−38933 doi: 10.1109/ACCESS.2018.2853129 CrossRef Google Scholar
[21]	Mariescu-Istodor R, Fränti P. 2021. Solving the large-scale TSP problem in 1 h: santa Claus challenge 2020. Frontiers in Robotics and AI 8:689908 doi: 10.3389/frobt.2021.689908 CrossRef Google Scholar
[22]	Alanzi E, El Bachir Menai M. 2025. Solving the traveling salesman problem with machine learning: a review of recent advances and challenges. Artificial Intelligence Review 58(9):267 doi: 10.1007/s10462-025-11267-x CrossRef Google Scholar
[23]	Bengio Y, Lodi A, Prouvost A. 2021. Machine learning for combinatorial optimization: a methodological tour d'horizon. European Journal of Operational Research 290(2):405−421 doi: 10.1016/j.ejor.2020.07.063 CrossRef Google Scholar
[24]	Deudon M, Cournut P, Lacoste A, Adulyasak Y, Rousseau LM. 2018. Learning heuristics for the TSP by policy gradient. In Integration of Constraint Programming, Artificial Intelligence, and Operations Research, ed. van Hoeve WJ. Cham: Springer. pp. 170−181 doi: 10.1007/978-3-319-93031-2_12
[25]	Vinyals O, Fortunato M, Jaitly N. 2015. Pointer Networks. Advances in Neural Information Processing Systems 28 (NIPS 2015). pp. 1−9 https://proceedings.neurips.cc/paper_files/paper/2015/hash/29921001f2f04bd3baee84a12e98098f-Abstract.html
[26]	Bello I, Pham H, Le QV, Norouzi M, Bengio S. 2016. Neural combinatorial optimization with reinforcement learning. arXiv 1611.09940 doi: 10.48550/arXiv.1611.09940 CrossRef Google Scholar
[27]	Kool W, van Hoof H, Welling M. 2018. Attention, learn to solve routing problems! arXiv 1803.08475 doi: 10.48550/arXiv.1803.08475 CrossRef Google Scholar
[28]	Wang J, Xiao C, Wang S, Ruan Y. 2023. Reinforcement learning for the traveling salesman problem: Performance comparison of three algorithms. The Journal of Engineering 2023(9):e12303 doi: 10.1049/tje2.12303 CrossRef Google Scholar
[29]	Bresson X, Laurent T. 2021. The transformer network for the traveling salesman problem. arXiv 2103.03012 doi: 10.48550/arXiv.2103.03012 CrossRef Google Scholar
[30]	Dai H, Khalil EB, Zhang Y, Dilkina B, Song L. 2017. Learning combinatorial optimization algorithms over graphs. arXiv 1704.01665 doi: 10.48550/arXiv.1704.01665 CrossRef Google Scholar
[31]	Joshi CK, Laurent T, Bresson X. 2019. An efficient graph convolutional network technique for the travelling salesman problem. arXiv 1906.01227 doi: 10.48550/arXiv.1906.01227 CrossRef Google Scholar
[32]	BinJubier MB, Ismail MA, Tusher EH, Aljanabi M, University A. 2024. A GPU accelerated parallel genetic algorithm for the traveling salesman problem. Journal of Soft Computing and Data Mining 5(2):137−150 doi: 10.30880/jscdm.2024.05.02.010 CrossRef Google Scholar
[33]	Ruan Y, Cai W, Wang J. 2024. Combining reinforcement learning algorithm and genetic algorithm to solve the traveling salesman problem. The Journal of Engineering 2024(6):e12393 doi: 10.1049/tje2.12393 CrossRef Google Scholar
[34]	Watkins CJCH, Dayan P. 1992. Q-learning. Machine Learning 8(3):279−292 doi: 10.1007/BF00992698 CrossRef Google Scholar
[35]	Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, et al. 2015. Human-level control through deep reinforcement learning. Nature 518:529−533 doi: 10.1038/nature14236 CrossRef Google Scholar
[36]	Mnih V, Kavukcuoglu K, Silver D, Graves A, Antonoglou I, et al. 2013. Playing atari with deep reinforcement learning. arXiv 1312.5602 doi: 10.48550/arXiv.1312.5602 CrossRef Google Scholar
[37]	Van Hasselt H, Guez A, Silver D. 2016. Deep reinforcement learning with double Q-learning. Proceedings of the AAAI Conference on Artificial Intelligence 30(1):2094−3100 doi: 10.1609/aaai.v30i1.10295 CrossRef Google Scholar
[38]	Hasselt H. 2010. Double Q-learning. Advances in Neural Information Processing Systems 23 (NIPS 2010). pp. 1−9 https://proceedings.neurips.cc/paper/2010/hash/091d584fced301b442654dd8c23b3fc9-Abstract.html
[39]	Schulman J, Moritz P, Levine S, Jordan M, Abbeel P. 2015. High-dimensional continuous control using generalized advantage estimation. arXiv 1506.02438 doi: 10.48550/arXiv.1506.02438 CrossRef Google Scholar
[40]	Sutton RS, McAllester D, Singh S, Mansour Y. 1999. Policy gradient methods for reinforcement learning with function approximation. Proceedings of the 13^th International Conference on Neural Information Processing Systems, 29 November 1999, Denver, CO. ACM. pp. 1057−1063 https://proceedings.neurips.cc/paper_files/paper/1999/hash/464d828b85b0bed98e80ade0a5c43b0f-Abstract.html
[41]	Weng L. 2018. Exploration strategies in deep reinforcement learning. https://lilianweng.github.io/posts/2020-06-07-exploration-drl/ (Accessed 16 May 2024)
[42]	Achiam J. 2018. Spinning up in deep reinforcement learning. https://spinningup.openai.com/en/latest/algorithms/sac.html (Accessed 16 May 2024)
[43]	Williams RJ. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning 8(3):229−256 doi: 10.1007/BF00992696 CrossRef Google Scholar
[44]	Mnih V, Badia AP, Mirza M, Graves A, Lillicrap T, et al. 2016. Asynchronous methods for deep reinforcement learning. International Conference on Machine Learning, 20–22 June 2016, New York, USA. vol. 48. PMLR. pp. 1928–1937 https://proceedings.mlr.press/v48/mniha16.html?ref=
[45]	Kirkpatrick J, Pascanu R, Rabinowitz N, Veness J, Desjardins G, et al. 2017. Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences of the United States of America 114(13):3521−3526 doi: 10.1073/pnas.1611835114 CrossRef Google Scholar
[46]	Schulman J, Wolski F, Dhariwal P, Radford A, Klimov O. 2017. Proximal policy optimization algorithms. arXiv 1707.06347 doi: 10.48550/arXiv.1707.06347 CrossRef Google Scholar
[47]	Haarnoja T, Zhou A, Abbeel P, Levine S. 2018. Soft Actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. Proceedings of the 35^th International Conference on Machine Learning, 10–15 July 2018, Stockholmsmässan, Stockholm Sweden. vol. 80. PMLR. pp. 1861–1870 https://proceedings.mlr.press/v80/haarnoja18b
[48]	Haarnoja T, Zhou A, Hartikainen K, Tucker G, Ha S, et al. 2018. Soft actor-critic algorithms and applications. arXiv 1812.05905 doi: 10.48550/arXiv.1812.05905 CrossRef Google Scholar
[49]	Duan J, Wang W, Xiao L, Gao J, Li SE, et al. 2025. Distributional soft actor-critic with three refinements. IEEE Transactions on Pattern Analysis and Machine Intelligence 47(5):3935−3946 doi: 10.1109/TPAMI.2025.3537087 CrossRef Google Scholar
[50]	Bahdanau D, Cho K, Bengio Y. 2014. Neural machine translation by jointly learning to align and translate. arXiv 1409.0473 doi: 10.48550/arXiv.1409.0473 CrossRef Google Scholar
[51]	Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, et al. 2017. Attention is all you need. In Advances in Neural Information Processing Systems. pp. 5998–6008 https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html
[52]	Hochreiter S, Schmidhuber J. 1997. Long short-term memory. Neural Computation 9(8):1735−1780 doi: 10.1162/neco.1997.9.8.1735 CrossRef Google Scholar
[53]	Alammar J. 2018. The illustrated transformer. https://jalammar.github.io/illustrated-transformer (Accessed 16 May 2024)
[54]	Nazari M, Oroojlooy A, Snyder LV, Takáč M. 2018. Reinforcement learning for solving the vehicle routing problem. arXiv 1802.04240 doi: 10.48550/arXiv.1802.04240 CrossRef Google Scholar
[55]	Weng J, Chen H, Yan D, You K, Duburcq A, et al. 2021. Tianshou: a highly modularized deep reinforcement learning library. arXiv 2107.14171 doi: 10.48550/arXiv.2107.14171 CrossRef Google Scholar
[56]	Pinto L, Davidson J, Sukthankar R, Gupta A. 2017. Robust adversarial reinforcement learning. Proceedings of the 34^th International Conference on Machine Learning. August 6−11, 2017, Sydney, NSW, Australia. vol. 70. PMLR. pp. 2817−2826 https://proceedings.mlr.press/v70/pinto17a.html
[57]	Liessner R, Schmitt J, Dietermann A, Bäker B. 2019. Hyperparameter optimization for deep reinforcement learning in vehicle energy management. Proceedings of the 11^th International Conference on Agents and Artificial Intelligence. February 19−21, 2019. Prague, Czech Republic. Portugal: SciTePress. pp. 134−144 doi: 10.5220/0007364701340144

Approach	n = 20			n = 50			n = 100
Approach	Gap	Len	Time	Gap	Len	Time	Gap	Len	Time
Gurobi (LP, optimal)	0.00%	3.84	7 s	0.00%	5.70	2 m	0.00%	7.76	17 m
Bello et al. (greedy)	1.30%	3.89	−	4.39%	5.95	−	6.96%	8.30	−
Dai et al.	1.30%	3.89	−	5.09%	5.99	−	7.09%	8.31	−
Kool et al. (greedy)	0.52%	3.86	0 s	1.75%	5.80	2 s	4.51%	8.11	6 s
Random walk	273%	10.48	−	456%	25.99	−	671%	52.06	−

Ranking	n = 20		n = 50		n = 100
Ranking	Len	Alg	Len	Alg	Len	Alg
1	3.898	SAC 20	6.080	SAC 15,25	9.018	SAC 15,25
2	3.905	SAC 15,25	6.149	SAC 20	9.282	SAC 20
3	3.919	DQN 15,25	6.290	A2C 15,25	9.543	A2C 15,25
4	3.923	A2C 15,25	6.293	PPO 15,25	9.551	PPO 15,25
5	3.928	DQN 20	6.353	DQN 15,25	9.759	PPO 20
6	3.928	PPO 20	6.373	PPO 20	9.832	DQN 15,25
7	3.945	A2C 20	6.449	A2C 20	9.972	A2C 20
8	3.953	PPO 15,25	6.557	DQN 20	10.393	PG 20
9	4.066	PG 20	6.712	PG 20	10.408	PG 15,25
10	4.107	PG 15,25	6.767	PG 15,25	10.514	DQN 20

{{lists.name}}

A comparison of state-of-the-art reinforcement learning algorithms applied to the traveling salesman problem

Abstract

Supplementary information

Rights and permissions

References

About this article

Cite this article

Article Metrics

Access History

Other Articles By Authors