Northeastern University Khoury College of Computer Sciences, Boston, MA, USA, e-mails: ruiyang@ccs.neu.edu, lieber@ccs.neu.edu"/>
Search
2020 Volume 35
Article Contents
RESEARCH ARTICLE   Open Access    

Learning self-play agents for combinatorial optimization problems

More Information
  • Abstract: Recent progress in reinforcement learning (RL) using self-play has shown remarkable performance with several board games (e.g., Chess and Go) and video games (e.g., Atari games and Dota2). It is plausible to hypothesize that RL, starting from zero knowledge, might be able to gradually approach a winning strategy after a certain amount of training. In this paper, we explore neural Monte Carlo Tree Search (neural MCTS), an RL algorithm that has been applied successfully by DeepMind to play Go and Chess at a superhuman level. We try to leverage the computational power of neural MCTS to solve a class of combinatorial optimization problems. Following the idea of Hintikka’s Game-Theoretical Semantics, we propose the Zermelo Gamification to transform specific combinatorial optimization problems into Zermelo games whose winning strategies correspond to the solutions of the original optimization problems. A specially designed neural MCTS algorithm is then introduced to train Zermelo game agents. We use a prototype problem for which the ground-truth policy is efficiently computable to demonstrate that neural MCTS is promising.
  • 加载中
  • Anthony , T., Tian , Z. & Barber , D.2017. Thinking fast and slow with deep learning and tree search. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS ’17, 5366–5376.

    Google Scholar

    Auer , P., Cesa-Bianchi , N. & Fischer , P.2002. Finite-time analysis of the multiarmed bandit problem. Machine Learning47(2), 235–256.

    Google Scholar

    Auger , D., Couetoux , A. & Teytaud , O.2013. Continuous upper confidence trees with polynomial exploration - consistency. In ECML/PKDD (1), Lecture Notes in Computer Science8188, 194–209. Springer.

    Google Scholar

    Battaglia , P. W., Hamrick , J. B., Bapst , V., Sanchez-Gonzalez , A., Zambaldi , V., Malinowski , M., Tacchetti , A., Raposo , D., Santoro , A., Faulkner , R., Gulcehre , C., Song , F., Ballard , A, Gilmer , J., Dahl , G., Vaswani , A., Allen , K., Nash , C., Langston , V., Dyer , C., Heess , N, Wierstra , D., Kohli , P., Botvinick , M., Vinyals , O., Li , Y. & Pascanu , R.2018. Relational inductive biases, deep learning, and graph networks. arXiv preprint .

    Google Scholar

    Bello , I., Pham , H., Le , Q. V., Norouzi , M. & Bengio , S.2016. Neural combinatorial optimization with reinforcement learning. arXiv preprint .

    Google Scholar

    Bjornsson , Y. & Finnsson , H.2009. Cadiaplayer: a simulation-based general game player. IEEE Transactions on Computational Intelligence and AI in Games1(1), 4–15.

    Google Scholar

    Browne , C., Powley , E. J., Whitehouse , D., Lucas , S. M., Cowling , P. I., Rohlfshagen , P., Tavener , S., Liebana , D. P., Samothrakis , S. & Colton , S.2012. A survey of Monte Carlo tree search methods. IEEE Transactions on Computational Intelligence and AI in Games4(1), 1–43.

    Google Scholar

    Fujikawa , Y. & Min , M.2013. A new environment for algorithm research using gamification. IEEE International Conference on Electro-Information Technology , EIT 2013, Rapid City, SD, 1–6.

    Google Scholar

    Genesereth , M., Love , N. & Pell , B.2005. General game playing: Overview of the AAAI competition. AI Magazine26(2), 62.

    Google Scholar

    Gilmer , J., Schoenholz , S. S., Riley , P. F., Vinyals , O. & Dahl , G. E.2017. Neural message passing for quantum chemistry. In Proceedings of the 34th International Conference on Machine Learning 70, 1263–1272. JMLR. org.

    Google Scholar

    Hintikka , J.1982. Game-theoretical semantics: insights and prospects. Notre Dame Journal of Formal Logic23(2), 219–241.

    Google Scholar

    Khalil , E., Dai , H., Zhang , Y., Dilkina , B. & Song , L.2017. Learning combinatorial optimization algorithms over graphs. In Advances in Neural Information Processing Systems, 6348–6358.

    Google Scholar

    Kocsis , L. & Szepesvári , C.2006. Bandit based Monte-Carlo planning. In Proceedings of the 17th European Conference on Machine Learning, ECML ’06, 282–293. Springer-Verlag.

    Google Scholar

    Laterre , A., Fu , Y., Jabri , M. K., Cohen , A.-S., Kas , D., Hajjar , K., Dahl , T. S., Kerkeni , A. & Beguir , K.2018. Ranked reward: enabling self-play reinforcement learning for combinatorial optimization. arXiv preprint .

    Google Scholar

    Mnih , V., Kavukcuoglu , K., Silver , D., Rusu , A. A., Veness , J., Bellemare , M. G., Graves , A., Riedmiller , M., Fidjeland , A. K., Ostrovski , G., Petersen , S., Beattie , C., Sadik , A., Antonoglou , I., King , H., Kumaran , D., Wierstra , D., Legg , S. & Hassabis , D.2015. Human-level control through deep reinforcement learning. Nature518(7540), 529–533.

    Google Scholar

    Novikov , F. & Katsman , V.2018. Gamification of problem solving process based on logical rules. In Informatics in Schools. Fundamentals of Computer Science and Software Engineering, Pozdniakov , S. N. & Dagienė , V. (eds). Springer International Publishing, 369–380.

    Google Scholar

    Racanière , S., Weber , T., Reichert , D. P., Buesing , L., Guez , A., Rezende , D., Badia , A. P., Vinyals , O., Heess , N., Li , Y., Pascanu , R., Battaglia , P., Hassabis , D., Silver , D. & Wierstra , D.2017. Imagination-augmented agents for deep reinforcement learning. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS ’17, 5694–5705. Curran Associates Inc.

    Google Scholar

    Rezende , M. & Chaimowicz , L.2017. A methodology for creating generic game playing agents for board games. In 2017 16th Brazilian Symposium on Computer Games and Digital Entertainment (SBGames), 19–28. IEEE.

    Google Scholar

    Selsam , D., Lamm , M., Bunz , B., Liang , P., de Moura , L. & Dill , D. L.2018. Learning a sat solver from single-bit supervision. arXiv preprint .

    Google Scholar

    Silver , D., Hubert , T., Schrittwieser , J., Antonoglou , I., Lai , M., Guez , A., Lanctot , M., Sifre , L., Kumaran , D., Graepel , T., Lillicrap , T. P., Simonyan , K. & Hassabis , D.2017a. Mastering Chess and Shogi by self-play with a general reinforcement learning algorithm. CoRR .

    Google Scholar

    Silver , D., Hubert , T., Schrittwieser , J., Antonoglou , I., Lai , M., Guez , A., Lanctot , M., Sifre , L., Kumaran , D., Graepel , T., Lillicrap , T., Simonyan , K. & Hassabis , D.2018. A general reinforcement learning algorithm that masters Chess, Shogi, and Go through self-play. Science362(6419), 1140–1144.

    Google Scholar

    Silver , D., Schrittwieser , J., Simonyan , K., Antonoglou , I., Huang , A., Guez , A., Hubert , T., Baker , L., Lai , M., Bolton , A., Chen , Y., Lillicrap , T., Hui , F., Sifre , L., van den Driessche , G., Graepel , T. & Hassabis , D.2017b. Mastering the game of Go without human knowledge. Nature550, 354.

    Google Scholar

    Sniedovich , M.2003. OR/MS games: 4. The joy of egg-dropping in Braunschweig and Hong Kong. INFORMS Transactions on Education4(1), 48–64.

    Google Scholar

    Vinyals , O., Fortunato , M. & Jaitly , N.2015. Pointer networks. In Advances in Neural Information Processing Systems, Cortes , C., Lawrence , N. D., Lee , D. D., Sugiyama , M. & Garnett , R. (eds), 28. Curran Associates, Inc., 2692–2700.

    Google Scholar

    Williams , R. J.1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning8, 229–256.

    Google Scholar

  • Cite this article

    Ruiyang Xu, Karl Lieberherr. 2020. Learning self-play agents for combinatorial optimization problems. The Knowledge Engineering Review 35(1), doi: 10.1017/S026988892000020X
    Ruiyang Xu, Karl Lieberherr. 2020. Learning self-play agents for combinatorial optimization problems. The Knowledge Engineering Review 35(1), doi: 10.1017/S026988892000020X

Article Metrics

Article views(83) PDF downloads(156)

Other Articles By Authors

RESEARCH ARTICLE   Open Access    

Learning self-play agents for combinatorial optimization problems

Abstract: Abstract: Recent progress in reinforcement learning (RL) using self-play has shown remarkable performance with several board games (e.g., Chess and Go) and video games (e.g., Atari games and Dota2). It is plausible to hypothesize that RL, starting from zero knowledge, might be able to gradually approach a winning strategy after a certain amount of training. In this paper, we explore neural Monte Carlo Tree Search (neural MCTS), an RL algorithm that has been applied successfully by DeepMind to play Go and Chess at a superhuman level. We try to leverage the computational power of neural MCTS to solve a class of combinatorial optimization problems. Following the idea of Hintikka’s Game-Theoretical Semantics, we propose the Zermelo Gamification to transform specific combinatorial optimization problems into Zermelo games whose winning strategies correspond to the solutions of the original optimization problems. A specially designed neural MCTS algorithm is then introduced to train Zermelo game agents. We use a prototype problem for which the ground-truth policy is efficiently computable to demonstrate that neural MCTS is promising.

    • Our implementation is based on an open-source, lightweight framework, AlphaZero General: https://github.com/suragnair/alpha-zero-general.

    • Theoretically, the exploratory term should be $\sqrt{\frac{\sum_{a^{\prime}}N(s_{i-1},a^{\prime})}{N(s_{i-1},a)+1}}$; however, AlphaZero used the variant $\frac{\sqrt{\sum_{a^{\prime}} N(s_{i-1},a^{\prime})}}{N(s_{i-1},a)+1}$ without any explanation. We tried both in our implementation, and it turns out that only the AlphaZero one works, while the other one quickly converges to an incorrect strategy.

    • In our paper, we call Verifier the Proponent, and Falsifier the OP.

    • The experiments reported in this paper use TensorFlow 1.15 with Graphics Processing Unit (GPU) support for training the neural networks. To test reproducibility, we also ran the experiments on TensorFlow 2.0, which does not have GPU support yet. To our surprise, our algorithms stopped converging to the winning strategy on TensorFlow 2.0. We remark this lack of reproducibility of our results for researchers who plan to build on our work. We are still trying to find the root cause behind the reproducibility problem, which is most likely caused by an implementation bug in TensorFlow 2.0.

    • © Cambridge University Press, 20202020Cambridge University Press
References (25)
  • About this article
    Cite this article
    Ruiyang Xu, Karl Lieberherr. 2020. Learning self-play agents for combinatorial optimization problems. The Knowledge Engineering Review 35(1), doi: 10.1017/S026988892000020X
    Ruiyang Xu, Karl Lieberherr. 2020. Learning self-play agents for combinatorial optimization problems. The Knowledge Engineering Review 35(1), doi: 10.1017/S026988892000020X
  • Catalog

      /

      DownLoad:  Full-Size Img  PowerPoint
      Return
      Return