School of Computer Science, Mila - McGill University, Montreal, Quebec E-mail: arushi.jain@mail.mcgill.ca, khimya.khetarpal@mail.mcgill.ca, dprecup@cs.mcgill.ca"/>
Search
2021 Volume 36
Article Contents
RESEARCH ARTICLE   Open Access    

Safe option-critic: learning safety in the option-critic architecture

More Information
  • Abstract: Designing hierarchical reinforcement learning algorithms that exhibit safe behaviour is not only vital for practical applications but also facilitates a better understanding of an agent’s decisions. We tackle this problem in the options framework (Sutton, Precup & Singh, 1999), a particular way to specify temporally abstract actions which allow an agent to use sub-policies with start and end conditions. We consider a behaviour as safe that avoids regions of state space with high uncertainty in the outcomes of actions. We propose an optimization objective that learns safe options by encouraging the agent to visit states with higher behavioural consistency. The proposed objective results in a trade-off between maximizing the standard expected return and minimizing the effect of model uncertainty in the return. We propose a policy gradient algorithm to optimize the constrained objective function. We examine the quantitative and qualitative behaviours of the proposed approach in a tabular grid world, continuous-state puddle world, and three games from the Arcade Learning Environment: Ms. Pacman, Amidar, and Q*Bert. Our approach achieves a reduction in the variance of return, boosts performance in environments with intrinsic variability in the reward structure, and compares favourably both with primitive actions and with risk-neutral options.
  • 加载中
  • Amodei , D., Olah , C., Steinhardt , J., Christiano , P. F., Schulman , J. & ManÉ , D. 2016. Concrete problems in AI safety. CoRR.

    Google Scholar

    Bacon , P.-L., Harb , J. & Precup , D. 2017. The option-critic architecture. In AAAI, 1726–1734.

    Google Scholar

    Barreto , A., Borsa , D., Hou , S., Comanici , G., AygÜn , E., Hamel , P., Toyama , D., Mourad , S., Silver , D., Precup , D., et al. 2019. The option keyboard: combining skills in reinforcement learning. In Advances in Neural Information Processing Systems, 13052–13062.

    Google Scholar

    Barto , A. G. & Mahadevan , S. 2003. Recent advances in hierarchical reinforcement learning. Discrete Event Dynamic Systems 13(4), 341–379.

    Google Scholar

    Bellemare , M. G., Naddaf , Y., Veness , J. & Bowling , M. 2013. The arcade learning environment: an evaluation platform for general agents. Journal of Artificial Intelligence Research 47, 253–279.

    Google Scholar

    Borkar , V. S. & Meyn , S. P. 2002. Risk-sensitive optimal control for Markov decision processes with monotone cost. Mathematics of Operations Research 27(1), 192–209.

    Google Scholar

    Daniel , C., Van Hoof , H., Peters , J. & Neumann , G. 2016. Probabilistic inference for determining options in reinforcement learning. Machine Learning 104(2–3), 337–357.

    Google Scholar

    Dietterich , T. G. 2000. Hierarchical reinforcement learning with the MAXQ value function decomposition. Journal of Artificial Intelligence Research 13, 227–303.

    Google Scholar

    Fikes , R. E., Hart , P. E. & Nilsson , N. J. 1972. Learning and executing generalized robot plans. Artificial Intelligence 3, 251–288.

    Google Scholar

    Fikes , R. E., Hart , P. E. & Nilsson , N. J. 1981. Learning and executing generalized robot plans. In Readings in Artificial Intelligence. Elsevier, 231–249.

    Google Scholar

    Future of Life Institute 2017. Asilomar AI principles.

    Google Scholar

    Garca , J. & FernÁndez , F. 2015. A comprehensive survey on safe reinforcement learning. Journal of Machine Learning Research 16(1), 1437–1480.

    Google Scholar

    Gehring , C. & Precup , D. 2013. Smart exploration in reinforcement learning using absolute temporal difference errors. In Proceedings of the 2013 International Conference on Autonomous Agents and Multi-agent Systems, AAMAS 2013, 1037–1044.

    Google Scholar

    Geibel , P. & Wysotzki , F. 2005. Risk-sensitive reinforcement learning applied to control under constraints. Journal of Artificial Intelligence Research (JAIR) 24, 81–108.

    Google Scholar

    Harb , J., Bacon , P.-L., Klissarov , M. & Precup , D. 2018. When waiting is not an option: learning options with a deliberation cost. In AAAI.

    Google Scholar

    Heger , M. 1994. Consideration of risk in reinforcement learning. In Machine Learning Proceedings 1994. Elsevier, 105–111.

    Google Scholar

    Howard , R. A. & Matheson , J. E. 1972. Risk-sensitive Markov decision processes. Management Science 18(7), 356–369.

    Google Scholar

    Iba , G. A. 1989. A heuristic approach to the discovery of macro-operators. Machine Learning 3(4), 285–317.

    Google Scholar

    Iyengar , G. N. 2005. Robust dynamic programming. Mathematics of Operations Research 30(2), 257–280.

    Google Scholar

    Jain , A., Patil , G., Jain , A., Khetarpal , K. & Precup , D. 2021. Variance penalized on-policy and off-policy actor-critic. arXiv preprint arXiv:2102.01985.

    Google Scholar

    Jain , A. & Precup , D. 2018. Eligibility traces for options. In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems, 1008–1016.

    Google Scholar

    Khetarpal , K., Klissarov , M., Chevalier-Boisvert , M., Bacon , P.-L. & Precup , D. 2020. Options of interest: Temporal abstraction with interest functions. In Proceedings of the AAAI Conference on Artificial Intelligence, 34, 4444–4451.

    Google Scholar

    Konidaris , G. & Barto , A. G. 2007. Building portable options: Skill transfer in reinforcement learning. In IJCAI, 7, 895–900.

    Google Scholar

    Konidaris , G., Kuindersma , S., Grupen , R. A. & Barto , A. G. 2011. Autonomous skill acquisition on a mobile manipulator. In AAAI.

    Google Scholar

    Korf , R. E. 1983. Learning to Solve Problems by Searching for Macro-operators. PhD thesis, Pittsburgh, PA, USA. AAI8425820.

    Google Scholar

    Kulkarni , T. D., Narasimhan , K., Saeedi , A. & Tenenbaum , J. 2016. Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation. In Advances in Neural Information Processing Systems, 3675–3683.

    Google Scholar

    Law , E. L., Coggan , M., Precup , D. & Ratitch , B. 2005. Risk-directed exploration in reinforcement learning. In Planning and Learning in A Priori Unknown or Dynamic Domains, 97.

    Google Scholar

    Lim , S. H., Xu , H. & Mannor , S. 2013. Reinforcement learning in robust Markov decision processes. Advances in Neural Information Processing Systems 26, 701–709.

    Google Scholar

    Machado , M. C., Bellemare , M. G., Talvitie , E., Veness , J., Hausknecht , M. & Bowling , M. 2017. Revisiting the arcade learning environment: evaluation protocols and open problems for general agents. ArXiv e-prints.

    Google Scholar

    Mankowitz , D. J., Mann , T. A. & Mannor , S. 2016. Adaptive skills adaptive partitions (ASAP). In Advances in Neural Information Processing Systems, 1588–1596.

    Google Scholar

    McGovern , A. & Barto , A. G. 2001. Automatic discovery of subgoals in reinforcement learning using diverse density. In ICML, 1, 361–368.

    Google Scholar

    Menache , I., Mannor , S. & Shimkin , N. 2002. Q-cut - dynamic discovery of sub-goals in reinforcement learning. In European Conference on Machine Learning. Springer, 295–306.

    Google Scholar

    Mnih , V., Badia , A. P., Mirza , M., Graves , A., Lillicrap , T., Harley , T., Silver , D. & Kavukcuoglu , K. 2016. Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning, 1928–1937.

    Google Scholar

    Nair , A., Srinivasan , P., Blackwell , S., Alcicek , C., Fearon , R., Maria , A. D., Panneershelvam , V., Suleyman , M., Beattie , C., Petersen , S., Legg , S., Mnih , V., Kavukcuoglu , K. & Silver , D. 2015. Massively parallel methods for deep reinforcement learning. CoRR.

    Google Scholar

    Nilim , A. & El Ghaoui , L. 2005. Robust control of Markov decision processes with uncertain transition matrices. Operations Research 53(5), 780–798.

    Google Scholar

    Parr , R. & Russell , S. J. 1998. Reinforcement learning with hierarchies of machines. In Advances in Neural Information Processing Systems, 1043–1049.

    Google Scholar

    Precup , D. 2000. Temporal abstraction in reinforcement learning (University of Massachusetts Amherst).

    Google Scholar

    Riemer , M., Liu , M. & Tesauro , G. 2018. Learning abstract options. In Advances in Neural Information Processing Systems, 10424–10434.

    Google Scholar

    Sherstan , C., Ashley , D. R., Bennett , B., Young , K., White , A., White , M. & Sutton , R. S. 2018. Comparing direct and indirect temporal-difference methods for estimating the variance of the return. In Proceedings of Uncertainty in Artificial Intelligence, 63–72.

    Google Scholar

    Stolle , M. & Precup , D. 2002. Learning options in reinforcement learning. In International Symposium on Abstraction, Reformulation & Approximation. Springer, 212–223.

    Google Scholar

    Sutton , R. S. 1988. Learning to predict by the methods of temporal differences. Machine Learning 3(1), 9–44.

    Google Scholar

    Sutton , R. S. & Barto , A. G. 1998. Introduction to Reinforcement Learning. MIT Press, Cambridge, MA, USA, 1st edition.

    Google Scholar

    Sutton , R. S., McAllester , D. A., Singh , S. P. & Mansour , Y. 2000. Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems, 1057–1063.

    Google Scholar

    Sutton , R. S., Precup , D. & Singh , S. 1999. Between MDPs and semi-MDPs: a framework for temporal abstraction in reinforcement learning. Artificial Intelligence 112(1-2), 181–211.

    Google Scholar

    Tamar , A., Di Castro , D. & Mannor , S. 2012. Policy gradients with variance related risk criteria. In Proceedings of the Twenty-Ninth International Conference on Machine Learning, 387–396.

    Google Scholar

    Tamar , A., Di Castro , D. & Mannor , S. 2016. Learning the variance of the reward-to-go. Journal of Machine Learning Research 17(13), 1–36.

    Google Scholar

    Tamar , A., Xu , H. & Mannor , S. 2013. Scaling up robust MDPs by reinforcement learning. arXiv preprint arXiv:1306.6189.

    Google Scholar

    Van Hasselt , H., Guez , A. & Silver , D. 2016. Deep reinforcement learning with double Q-learning. In AAAI, 16, 2094–2100.

    Google Scholar

    Vezhnevets , A., Mnih , V., Osindero , S., Graves , A., Vinyals , O., Agapiou , J., et al. 2016. Strategic attentive writer for learning macro-actions. In Advances in Neural Information Processing Systems, 3486–3494.

    Google Scholar

    Wang , Z., de Freitas , N. & Lanctot , M. 2015. Dueling network architectures for deep reinforcement learning. CoRR.

    Google Scholar

    White , D. 1994. A mathematical programming approach to a problem in variance penalised Markov decision processes. Operations-Research-Spektrum 15(4), 225–230.

    Google Scholar

  • Cite this article

    Arushi Jain, Khimya Khetarpal, Doina Precup. 2021. Safe option-critic: learning safety in the option-critic architecture. The Knowledge Engineering Review 36(1), doi: 10.1017/S0269888921000035
    Arushi Jain, Khimya Khetarpal, Doina Precup. 2021. Safe option-critic: learning safety in the option-critic architecture. The Knowledge Engineering Review 36(1), doi: 10.1017/S0269888921000035

Article Metrics

Article views(67) PDF downloads(23)

Other Articles By Authors

RESEARCH ARTICLE   Open Access    

Safe option-critic: learning safety in the option-critic architecture

Abstract: Abstract: Designing hierarchical reinforcement learning algorithms that exhibit safe behaviour is not only vital for practical applications but also facilitates a better understanding of an agent’s decisions. We tackle this problem in the options framework (Sutton, Precup & Singh, 1999), a particular way to specify temporally abstract actions which allow an agent to use sub-policies with start and end conditions. We consider a behaviour as safe that avoids regions of state space with high uncertainty in the outcomes of actions. We propose an optimization objective that learns safe options by encouraging the agent to visit states with higher behavioural consistency. The proposed objective results in a trade-off between maximizing the standard expected return and minimizing the effect of model uncertainty in the return. We propose a policy gradient algorithm to optimize the constrained objective function. We examine the quantitative and qualitative behaviours of the proposed approach in a tabular grid world, continuous-state puddle world, and three games from the Arcade Learning Environment: Ms. Pacman, Amidar, and Q*Bert. Our approach achieves a reduction in the variance of return, boosts performance in environments with intrinsic variability in the reward structure, and compares favourably both with primitive actions and with risk-neutral options.

    • The authors would like to thank Open Philanthropy for funding this work, Compute Canada for the computing resources, Herke van Hoof, Ayush Jain, Pierre-Luc Bacon, Gandharv Patil, Jean Harb, Martin Klissarov, Kushal Arora, for constructive discussions throughout the duration of this work, and the anonymous reviewers for the feedback on earlier drafts of this manuscript.

    • The authors declare none.

    • https://github.com/arushi12130/SafeOptionCritic

    • The source code is available at https://github.com/kkhetarpal/safe_a2oc_delib.

    • Videos of trained agents in Atari games are available at https://sites.google.com/view/safe-option-critic.

    • These authors contributed equally to this work.

    • © The Author(s), 2021. Published by Cambridge University Press2021Cambridge University Press
References (51)
  • About this article
    Cite this article
    Arushi Jain, Khimya Khetarpal, Doina Precup. 2021. Safe option-critic: learning safety in the option-critic architecture. The Knowledge Engineering Review 36(1), doi: 10.1017/S0269888921000035
    Arushi Jain, Khimya Khetarpal, Doina Precup. 2021. Safe option-critic: learning safety in the option-critic architecture. The Knowledge Engineering Review 36(1), doi: 10.1017/S0269888921000035
  • Catalog

      /

      DownLoad:  Full-Size Img  PowerPoint
      Return
      Return