doi:10.1017/S0269888921000035

Amodei , D., Olah , C., Steinhardt , J., Christiano , P. F., Schulman , J. & ManÉ , D. 2016. Concrete problems in AI safety. CoRR.

Bacon , P.-L., Harb , J. & Precup , D. 2017. The option-critic architecture. In AAAI, 1726–1734.

Barreto , A., Borsa , D., Hou , S., Comanici , G., AygÜn , E., Hamel , P., Toyama , D., Mourad , S., Silver , D., Precup , D., et al. 2019. The option keyboard: combining skills in reinforcement learning. In Advances in Neural Information Processing Systems, 13052–13062.

Barto , A. G. & Mahadevan , S. 2003. Recent advances in hierarchical reinforcement learning. Discrete Event Dynamic Systems 13(4), 341–379.

Bellemare , M. G., Naddaf , Y., Veness , J. & Bowling , M. 2013. The arcade learning environment: an evaluation platform for general agents. Journal of Artificial Intelligence Research 47, 253–279.

Borkar , V. S. & Meyn , S. P. 2002. Risk-sensitive optimal control for Markov decision processes with monotone cost. Mathematics of Operations Research 27(1), 192–209.

Daniel , C., Van Hoof , H., Peters , J. & Neumann , G. 2016. Probabilistic inference for determining options in reinforcement learning. Machine Learning 104(2–3), 337–357.

Dietterich , T. G. 2000. Hierarchical reinforcement learning with the MAXQ value function decomposition. Journal of Artificial Intelligence Research 13, 227–303.

Fikes , R. E., Hart , P. E. & Nilsson , N. J. 1972. Learning and executing generalized robot plans. Artificial Intelligence 3, 251–288.

Fikes , R. E., Hart , P. E. & Nilsson , N. J. 1981. Learning and executing generalized robot plans. In Readings in Artificial Intelligence. Elsevier, 231–249.

Future of Life Institute 2017. Asilomar AI principles.

Garca , J. & FernÁndez , F. 2015. A comprehensive survey on safe reinforcement learning. Journal of Machine Learning Research 16(1), 1437–1480.

Gehring , C. & Precup , D. 2013. Smart exploration in reinforcement learning using absolute temporal difference errors. In Proceedings of the 2013 International Conference on Autonomous Agents and Multi-agent Systems, AAMAS 2013, 1037–1044.

Geibel , P. & Wysotzki , F. 2005. Risk-sensitive reinforcement learning applied to control under constraints. Journal of Artificial Intelligence Research (JAIR) 24, 81–108.

Harb , J., Bacon , P.-L., Klissarov , M. & Precup , D. 2018. When waiting is not an option: learning options with a deliberation cost. In AAAI.

Heger , M. 1994. Consideration of risk in reinforcement learning. In Machine Learning Proceedings 1994. Elsevier, 105–111.

Howard , R. A. & Matheson , J. E. 1972. Risk-sensitive Markov decision processes. Management Science 18(7), 356–369.

Iba , G. A. 1989. A heuristic approach to the discovery of macro-operators. Machine Learning 3(4), 285–317.

Iyengar , G. N. 2005. Robust dynamic programming. Mathematics of Operations Research 30(2), 257–280.

Jain , A., Patil , G., Jain , A., Khetarpal , K. & Precup , D. 2021. Variance penalized on-policy and off-policy actor-critic. arXiv preprint arXiv:2102.01985.

Jain , A. & Precup , D. 2018. Eligibility traces for options. In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems, 1008–1016.

Khetarpal , K., Klissarov , M., Chevalier-Boisvert , M., Bacon , P.-L. & Precup , D. 2020. Options of interest: Temporal abstraction with interest functions. In Proceedings of the AAAI Conference on Artificial Intelligence, 34, 4444–4451.

Konidaris , G. & Barto , A. G. 2007. Building portable options: Skill transfer in reinforcement learning. In IJCAI, 7, 895–900.

Konidaris , G., Kuindersma , S., Grupen , R. A. & Barto , A. G. 2011. Autonomous skill acquisition on a mobile manipulator. In AAAI.

Korf , R. E. 1983. Learning to Solve Problems by Searching for Macro-operators. PhD thesis, Pittsburgh, PA, USA. AAI8425820.

Kulkarni , T. D., Narasimhan , K., Saeedi , A. & Tenenbaum , J. 2016. Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation. In Advances in Neural Information Processing Systems, 3675–3683.

Law , E. L., Coggan , M., Precup , D. & Ratitch , B. 2005. Risk-directed exploration in reinforcement learning. In Planning and Learning in A Priori Unknown or Dynamic Domains, 97.

Lim , S. H., Xu , H. & Mannor , S. 2013. Reinforcement learning in robust Markov decision processes. Advances in Neural Information Processing Systems 26, 701–709.

Machado , M. C., Bellemare , M. G., Talvitie , E., Veness , J., Hausknecht , M. & Bowling , M. 2017. Revisiting the arcade learning environment: evaluation protocols and open problems for general agents. ArXiv e-prints.

Mankowitz , D. J., Mann , T. A. & Mannor , S. 2016. Adaptive skills adaptive partitions (ASAP). In Advances in Neural Information Processing Systems, 1588–1596.

McGovern , A. & Barto , A. G. 2001. Automatic discovery of subgoals in reinforcement learning using diverse density. In ICML, 1, 361–368.

Menache , I., Mannor , S. & Shimkin , N. 2002. Q-cut - dynamic discovery of sub-goals in reinforcement learning. In European Conference on Machine Learning. Springer, 295–306.

Mnih , V., Badia , A. P., Mirza , M., Graves , A., Lillicrap , T., Harley , T., Silver , D. & Kavukcuoglu , K. 2016. Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning, 1928–1937.

Nair , A., Srinivasan , P., Blackwell , S., Alcicek , C., Fearon , R., Maria , A. D., Panneershelvam , V., Suleyman , M., Beattie , C., Petersen , S., Legg , S., Mnih , V., Kavukcuoglu , K. & Silver , D. 2015. Massively parallel methods for deep reinforcement learning. CoRR.

Nilim , A. & El Ghaoui , L. 2005. Robust control of Markov decision processes with uncertain transition matrices. Operations Research 53(5), 780–798.

Parr , R. & Russell , S. J. 1998. Reinforcement learning with hierarchies of machines. In Advances in Neural Information Processing Systems, 1043–1049.

Precup , D. 2000. Temporal abstraction in reinforcement learning (University of Massachusetts Amherst).

Riemer , M., Liu , M. & Tesauro , G. 2018. Learning abstract options. In Advances in Neural Information Processing Systems, 10424–10434.

Sherstan , C., Ashley , D. R., Bennett , B., Young , K., White , A., White , M. & Sutton , R. S. 2018. Comparing direct and indirect temporal-difference methods for estimating the variance of the return. In Proceedings of Uncertainty in Artificial Intelligence, 63–72.

Stolle , M. & Precup , D. 2002. Learning options in reinforcement learning. In International Symposium on Abstraction, Reformulation & Approximation. Springer, 212–223.

Sutton , R. S. 1988. Learning to predict by the methods of temporal differences. Machine Learning 3(1), 9–44.

Sutton , R. S. & Barto , A. G. 1998. Introduction to Reinforcement Learning. MIT Press, Cambridge, MA, USA, 1st edition.

Sutton , R. S., McAllester , D. A., Singh , S. P. & Mansour , Y. 2000. Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems, 1057–1063.

Sutton , R. S., Precup , D. & Singh , S. 1999. Between MDPs and semi-MDPs: a framework for temporal abstraction in reinforcement learning. Artificial Intelligence 112(1-2), 181–211.

Tamar , A., Di Castro , D. & Mannor , S. 2012. Policy gradients with variance related risk criteria. In Proceedings of the Twenty-Ninth International Conference on Machine Learning, 387–396.

Tamar , A., Di Castro , D. & Mannor , S. 2016. Learning the variance of the reward-to-go. Journal of Machine Learning Research 17(13), 1–36.

Tamar , A., Xu , H. & Mannor , S. 2013. Scaling up robust MDPs by reinforcement learning. arXiv preprint arXiv:1306.6189.

Van Hasselt , H., Guez , A. & Silver , D. 2016. Deep reinforcement learning with double Q-learning. In AAAI, 16, 2094–2100.

Vezhnevets , A., Mnih , V., Osindero , S., Graves , A., Vinyals , O., Agapiou , J., et al. 2016. Strategic attentive writer for learning macro-actions. In Advances in Neural Information Processing Systems, 3486–3494.

Wang , Z., de Freitas , N. & Lanctot , M. 2015. Dueling network architectures for deep reinforcement learning. CoRR.

White , D. 1994. A mathematical programming approach to a problem in variance penalised Markov decision processes. Operations-Research-Spektrum 15(4), 225–230.