Safe option-critic: learning safety in the option-critic architecture

Arushi Jain; Khimya Khetarpal; Doina Precup; Arushi Jain; Khimya Khetarpal; Doina Precup

doi:10.1017/S0269888921000035

2021 Volume 36

Article Contents

Next Previous

RESEARCH ARTICLE Open Access

Safe option-critic: learning safety in the option-critic architecture

School of Computer Science, Mila - McGill University, Montreal, Quebec E-mail: arushi.jain@mail.mcgill.ca, khimya.khetarpal@mail.mcgill.ca, dprecup@cs.mcgill.ca
These authors contributed equally to this work.

More Information

Received: 17 December 2018
Revised: 28 February 2021
Accepted: 02 March 2021
Published online: 07 April 2021
The Knowledge Engineering Review 36, Article number: e4 (2021) | Cite this article

Abstract

Abstract: Designing hierarchical reinforcement learning algorithms that exhibit safe behaviour is not only vital for practical applications but also facilitates a better understanding of an agent’s decisions. We tackle this problem in the options framework (Sutton, Precup & Singh, 1999), a particular way to specify temporally abstract actions which allow an agent to use sub-policies with start and end conditions. We consider a behaviour as safe that avoids regions of state space with high uncertainty in the outcomes of actions. We propose an optimization objective that learns safe options by encouraging the agent to visit states with higher behavioural consistency. The proposed objective results in a trade-off between maximizing the standard expected return and minimizing the effect of model uncertainty in the return. We propose a policy gradient algorithm to optimize the constrained objective function. We examine the quantitative and qualitative behaviours of the proposed approach in a tabular grid world, continuous-state puddle world, and three games from the Arcade Learning Environment: Ms. Pacman, Amidar, and Q*Bert. Our approach achieves a reduction in the variance of return, boosts performance in environments with intrinsic variability in the reward structure, and compares favourably both with primitive actions and with risk-neutral options.
Rights and permissions
© The Author(s), 2021. Published by Cambridge University Press2021Cambridge University Press

References

Amodei , D., Olah , C., Steinhardt , J., Christiano , P. F., Schulman , J. & ManÉ , D. 2016. Concrete problems in AI safety. CoRR.

Google Scholar

Bacon , P.-L., Harb , J. & Precup , D. 2017. The option-critic architecture. In AAAI, 1726–1734.

Google Scholar

Barreto , A., Borsa , D., Hou , S., Comanici , G., AygÜn , E., Hamel , P., Toyama , D., Mourad , S., Silver , D., Precup , D., et al. 2019. The option keyboard: combining skills in reinforcement learning. In Advances in Neural Information Processing Systems, 13052–13062.

Google Scholar

Barto , A. G. & Mahadevan , S. 2003. Recent advances in hierarchical reinforcement learning. Discrete Event Dynamic Systems 13(4), 341–379.

Google Scholar

Bellemare , M. G., Naddaf , Y., Veness , J. & Bowling , M. 2013. The arcade learning environment: an evaluation platform for general agents. Journal of Artificial Intelligence Research 47, 253–279.

Google Scholar

Borkar , V. S. & Meyn , S. P. 2002. Risk-sensitive optimal control for Markov decision processes with monotone cost. Mathematics of Operations Research 27(1), 192–209.

Google Scholar

Daniel , C., Van Hoof , H., Peters , J. & Neumann , G. 2016. Probabilistic inference for determining options in reinforcement learning. Machine Learning 104(2–3), 337–357.

Google Scholar

Dietterich , T. G. 2000. Hierarchical reinforcement learning with the MAXQ value function decomposition. Journal of Artificial Intelligence Research 13, 227–303.

Google Scholar

Fikes , R. E., Hart , P. E. & Nilsson , N. J. 1972. Learning and executing generalized robot plans. Artificial Intelligence 3, 251–288.

Google Scholar

Fikes , R. E., Hart , P. E. & Nilsson , N. J. 1981. Learning and executing generalized robot plans. In Readings in Artificial Intelligence. Elsevier, 231–249.

Google Scholar

Future of Life Institute 2017. Asilomar AI principles.

Google Scholar

Garca , J. & FernÁndez , F. 2015. A comprehensive survey on safe reinforcement learning. Journal of Machine Learning Research 16(1), 1437–1480.

Google Scholar

Gehring , C. & Precup , D. 2013. Smart exploration in reinforcement learning using absolute temporal difference errors. In Proceedings of the 2013 International Conference on Autonomous Agents and Multi-agent Systems, AAMAS 2013, 1037–1044.

Google Scholar

Geibel , P. & Wysotzki , F. 2005. Risk-sensitive reinforcement learning applied to control under constraints. Journal of Artificial Intelligence Research (JAIR) 24, 81–108.

Google Scholar

Harb , J., Bacon , P.-L., Klissarov , M. & Precup , D. 2018. When waiting is not an option: learning options with a deliberation cost. In AAAI.

Google Scholar

Heger , M. 1994. Consideration of risk in reinforcement learning. In Machine Learning Proceedings 1994. Elsevier, 105–111.

Google Scholar

Howard , R. A. & Matheson , J. E. 1972. Risk-sensitive Markov decision processes. Management Science 18(7), 356–369.

Google Scholar

Iba , G. A. 1989. A heuristic approach to the discovery of macro-operators. Machine Learning 3(4), 285–317.

Google Scholar

Iyengar , G. N. 2005. Robust dynamic programming. Mathematics of Operations Research 30(2), 257–280.

Google Scholar

Jain , A., Patil , G., Jain , A., Khetarpal , K. & Precup , D. 2021. Variance penalized on-policy and off-policy actor-critic. arXiv preprint arXiv:2102.01985.

Google Scholar

Jain , A. & Precup , D. 2018. Eligibility traces for options. In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems, 1008–1016.

Google Scholar

Khetarpal , K., Klissarov , M., Chevalier-Boisvert , M., Bacon , P.-L. & Precup , D. 2020. Options of interest: Temporal abstraction with interest functions. In Proceedings of the AAAI Conference on Artificial Intelligence, 34, 4444–4451.

Google Scholar

Konidaris , G. & Barto , A. G. 2007. Building portable options: Skill transfer in reinforcement learning. In IJCAI, 7, 895–900.

Google Scholar

Konidaris , G., Kuindersma , S., Grupen , R. A. & Barto , A. G. 2011. Autonomous skill acquisition on a mobile manipulator. In AAAI.

Google Scholar

Korf , R. E. 1983. Learning to Solve Problems by Searching for Macro-operators. PhD thesis, Pittsburgh, PA, USA. AAI8425820.

Google Scholar

Kulkarni , T. D., Narasimhan , K., Saeedi , A. & Tenenbaum , J. 2016. Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation. In Advances in Neural Information Processing Systems, 3675–3683.

Google Scholar

Law , E. L., Coggan , M., Precup , D. & Ratitch , B. 2005. Risk-directed exploration in reinforcement learning. In Planning and Learning in A Priori Unknown or Dynamic Domains, 97.

Google Scholar

Lim , S. H., Xu , H. & Mannor , S. 2013. Reinforcement learning in robust Markov decision processes. Advances in Neural Information Processing Systems 26, 701–709.

Google Scholar

Machado , M. C., Bellemare , M. G., Talvitie , E., Veness , J., Hausknecht , M. & Bowling , M. 2017. Revisiting the arcade learning environment: evaluation protocols and open problems for general agents. ArXiv e-prints.

Google Scholar

Mankowitz , D. J., Mann , T. A. & Mannor , S. 2016. Adaptive skills adaptive partitions (ASAP). In Advances in Neural Information Processing Systems, 1588–1596.

Google Scholar

McGovern , A. & Barto , A. G. 2001. Automatic discovery of subgoals in reinforcement learning using diverse density. In ICML, 1, 361–368.

Google Scholar

Menache , I., Mannor , S. & Shimkin , N. 2002. Q-cut - dynamic discovery of sub-goals in reinforcement learning. In European Conference on Machine Learning. Springer, 295–306.

Google Scholar

Mnih , V., Badia , A. P., Mirza , M., Graves , A., Lillicrap , T., Harley , T., Silver , D. & Kavukcuoglu , K. 2016. Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning, 1928–1937.

Google Scholar

Nair , A., Srinivasan , P., Blackwell , S., Alcicek , C., Fearon , R., Maria , A. D., Panneershelvam , V., Suleyman , M., Beattie , C., Petersen , S., Legg , S., Mnih , V., Kavukcuoglu , K. & Silver , D. 2015. Massively parallel methods for deep reinforcement learning. CoRR.

Google Scholar

Nilim , A. & El Ghaoui , L. 2005. Robust control of Markov decision processes with uncertain transition matrices. Operations Research 53(5), 780–798.

Google Scholar

Parr , R. & Russell , S. J. 1998. Reinforcement learning with hierarchies of machines. In Advances in Neural Information Processing Systems, 1043–1049.

Google Scholar

Precup , D. 2000. Temporal abstraction in reinforcement learning (University of Massachusetts Amherst).

Google Scholar

Riemer , M., Liu , M. & Tesauro , G. 2018. Learning abstract options. In Advances in Neural Information Processing Systems, 10424–10434.

Google Scholar

Sherstan , C., Ashley , D. R., Bennett , B., Young , K., White , A., White , M. & Sutton , R. S. 2018. Comparing direct and indirect temporal-difference methods for estimating the variance of the return. In Proceedings of Uncertainty in Artificial Intelligence, 63–72.

Google Scholar

Stolle , M. & Precup , D. 2002. Learning options in reinforcement learning. In International Symposium on Abstraction, Reformulation & Approximation. Springer, 212–223.

Google Scholar

Sutton , R. S. 1988. Learning to predict by the methods of temporal differences. Machine Learning 3(1), 9–44.

Google Scholar

Sutton , R. S. & Barto , A. G. 1998. Introduction to Reinforcement Learning. MIT Press, Cambridge, MA, USA, 1st edition.

Google Scholar

Sutton , R. S., McAllester , D. A., Singh , S. P. & Mansour , Y. 2000. Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems, 1057–1063.

Google Scholar

Sutton , R. S., Precup , D. & Singh , S. 1999. Between MDPs and semi-MDPs: a framework for temporal abstraction in reinforcement learning. Artificial Intelligence 112(1-2), 181–211.

Google Scholar

Tamar , A., Di Castro , D. & Mannor , S. 2012. Policy gradients with variance related risk criteria. In Proceedings of the Twenty-Ninth International Conference on Machine Learning, 387–396.

Google Scholar

Tamar , A., Di Castro , D. & Mannor , S. 2016. Learning the variance of the reward-to-go. Journal of Machine Learning Research 17(13), 1–36.

Google Scholar

Tamar , A., Xu , H. & Mannor , S. 2013. Scaling up robust MDPs by reinforcement learning. arXiv preprint arXiv:1306.6189.

Google Scholar

Van Hasselt , H., Guez , A. & Silver , D. 2016. Deep reinforcement learning with double Q-learning. In AAAI, 16, 2094–2100.

Google Scholar

Vezhnevets , A., Mnih , V., Osindero , S., Graves , A., Vinyals , O., Agapiou , J., et al. 2016. Strategic attentive writer for learning macro-actions. In Advances in Neural Information Processing Systems, 3486–3494.

Google Scholar

Wang , Z., de Freitas , N. & Lanctot , M. 2015. Dueling network architectures for deep reinforcement learning. CoRR.

Google Scholar

White , D. 1994. A mathematical programming approach to a problem in variance penalised Markov decision processes. Operations-Research-Spektrum 15(4), 225–230.

Google Scholar

About this article

Cite this article

Arushi Jain, Khimya Khetarpal, Doina Precup. 2021. Safe option-critic: learning safety in the option-critic architecture. The Knowledge Engineering Review. 36:35 doi: 10.1017/S0269888921000035

Arushi Jain, Khimya Khetarpal, Doina Precup. 2021. Safe option-critic: learning safety in the option-critic architecture. The Knowledge Engineering Review. 36:35 doi: 10.1017/S0269888921000035

Download PDF

Article Metrics

Article views(375) PDF downloads(634)

{{lists.name}}

Safe option-critic: learning safety in the option-critic architecture

Abstract

Rights and permissions

References

About this article

Cite this article

Article Metrics

Access History

Other Articles By Authors