Secondmind, Cambridge, CB2 1LA, UK e-mails: sergiovalmac@gmail.com, davies.ian.r@gmail.com, aleksi.tukiainen@gmail.com, enrique@people-ai.com"/>
Search
2021 Volume 36
Article Contents
RESEARCH ARTICLE   Open Access    

Fully distributed actor-critic architecture for multitask deep reinforcement learning

More Information
  • Abstract: We propose a fully distributed actor-critic architecture, named diffusion-distributed-actor-critic Diff-DAC, with application to multitask reinforcement learning (MRL). During the learning process, agents communicate their value and policy parameters to their neighbours, diffusing the information across a network of agents with no need for a central station. Each agent can only access data from its local task, but aims to learn a common policy that performs well for the whole set of tasks. The architecture is scalable, since the computational and communication cost per agent depends on the number of neighbours rather than the overall number of agents. We derive Diff-DAC from duality theory and provide novel insights into the actor-critic framework, showing that it is actually an instance of the dual-ascent method. We prove almost sure convergence of Diff-DAC to a common policy under general assumptions that hold even for deep neural network approximations. For more restrictive assumptions, we also prove that this common policy is a stationary point of an approximation of the original problem. Numerical results on multitask extensions of common continuous control benchmarks demonstrate that Diff-DAC stabilises learning and has a regularising effect that induces higher performance and better generalisation properties than previous architectures.
  • 加载中
  • Andreas , J., Klein , D. & Levine , S. 2017. Modular multitask reinforcement learning with policy sketches. In Proceedings of the International Conference on Machine Learning (ICML), 166–175.

    Google Scholar

    Arrow , K. J., Hurwicz , L. & Uzawa , H. 1958. Studies in Linear and Non-linear Programming. Stanford University Press.

    Google Scholar

    Assran , M., Romoff , J., Ballas , N., Pineau , J. & Rabbat , M. 2019. Gossip-based actor-learner architectures for deep reinforcement learning. In Advances in Neural Information Processing Systems (NIPS), 13320–13330.

    Google Scholar

    Baird III, L. C. 1993. Advantage Updating, Technical report, Wright Lab Wright-Patterson AFB OH.

    Google Scholar

    Bertsekas , D. P. 2009. Convex Optimization Theory. Athena Scientific.

    Google Scholar

    Bertsekas , D. P. 2012. Dynamic Programming and Optimal Control, 4th edition, 2. Athena Scientific.

    Google Scholar

    Bhatnagar , S., Sutton , R. S., Ghavamzadeh , M. & Lee , M. 2009. Natural actor-critic algorithms. Automatica 45(11), 2471–2482.

    Google Scholar

    Bianchi , P. & Jakubowicz , J. 2013. Convergence of a multi-agent projected stochastic gradient algorithm for non-convex optimization. IEEE Transactions on Automatic Control 58(2), 391–405.

    Google Scholar

    Borkar , V. 2008. Stochastic Approximation: A Dynamical Systems Viewpoint. Cambridge University Press.

    Google Scholar

    Borkar , V. S. 1997. Stochastic approximation with two time scales. Systems and Control Letters 29(5), 291–294.

    Google Scholar

    Borkar , V. S. & Meyn , S. 1999. The O.D.E. method for convergence of stochastic approximation and reinforcement learning. SIAM Journal on Control and Optimization 38, 447–469.

    Google Scholar

    Bou-Ammar , H., Eaton , E., Ruvolo , P. & Taylor , M. 2014. Online multi-task learning for policy gradient methods. In Proceedings of the International Conference on Machine Learning (ICML), 1206–1214.

    Google Scholar

    Boyd , S. & Vandenberghe , L. 2004. Convex Optimization. Cambridge University Press.

    Google Scholar

    Brockman , G., Cheung , V., Pettersson , L., Schneider , J., Schulman , J., Tang , J. & Zaremba , W. 2016. OpenAI Gym.

    Google Scholar

    Chen , J. & Sayed , A. H. 2013. Distributed Pareto optimization via diffusion strategies. IEEE Journal of Selected Topics in Signal Processing 7(2), 205–220.

    Google Scholar

    El Bsat , S., Bou-Ammar , H. & Taylor , M. E. 2017. Scalable multitask policy gradient reinforcement learning. In AAAI Conference on Artificial Intelligence (AAAI), 1847–1853.

    Google Scholar

    Espeholt , L., Soyer , H., Munos , R., Simonyan , K., Mnih , V., Ward , T., Doron , Y., Firoiu , V., Harley , T., Dunning , I. et al. 2018. Impala: scalable distributed deep-rl with importance weighted actor-learner architectures. In ICML.

    Google Scholar

    Fu , J., Levine , S. & Abbeel , P. 2016. One-shot learning of manipulation skills with online dynamics adaptation and neural network priors. In IEEE RSJ International Conference on Intelligent Robots and Systems (IROS), 4019–4026.

    Google Scholar

    Golub , G. & Van Loan , C. 1996. Matrix Computations. Johns Hopkins University Press.

    Google Scholar

    Grondman , I., Busoniu , L., Lopes , G. A. D. & Babuska , R. 2012. A survey of actor-critic reinforcement learning: standard and natural policy gradients. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) 42(6), 1291–1307.

    Google Scholar

    Heess , N., Wayne , G., Silver , D., Lillicrap , T., Erez , T. & Tassa , Y. 2015. Learning continuous control policies by stochastic value gradients. In Advances in Neural Information Processing Systems (NIPS), 2926–2934.

    Google Scholar

    Horn , R. & Johnson , C. 1990. Matrix Analysis. Cambridge University Press.

    Google Scholar

    Kar , S., Moura , J. M. F. & Poor , H. V. 2013. QD-learning: a collaborative distributed strategy for multi-agent reinforcement learning through consensus + innovations. IEEE Transactions on Signal Processing 61(7), 1848–1862.

    Google Scholar

    Karp , R. M. 1972. Reducibility among combinatorial problems. In Complexity of Computer Computations. Springer, 85–103.

    Google Scholar

    Kingma , D. & Ba , J. L. 2015. Adam: a method for stochastic optimization. In Proceedings of the International Conference on Learning Representations (ICLR).

    Google Scholar

    Kober , J. & Peters , J. R. 2009. Policy search for motor primitives in robotics. In Advances in Neural Information Processing Systems (NIPS), 849–856.

    Google Scholar

    Konda , V. R. & Tsitsiklis , J. N. 2003. On actor-critic algorithms. SIAM Journal on Control and Optimization 42(4), 1143–1166.

    Google Scholar

    Lakshminarayanan , C. & Bhatnagar , S. 2017. A stability criterion for two timescale stochastic approximation schemes. Automatica 79, 108–114.

    Google Scholar

    Lillicrap , T. P., Hunt , J. J., Pritzel , A., Heess , N., Erez , T., Tassa , Y., Silver , D. & Wierstra , D. 2015. Continuous control with deep reinforcement learning.

    Google Scholar

    Melo , F. S. & Lopes , M. 2008. Fitted natural actor-critic: a new algorithm for continuous state-action MDPs. In Machine Learning and Knowledge Discovery in Databases, 5212. Springer, 66–81.

    Google Scholar

    Mnih , V., Badia , A. P., Mirza , M., Graves , A., Lillicrap , T., Harley , T., Silver , D. & Kavukcuoglu , K. 2016. Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning, 1928–1937.

    Google Scholar

    Mnih , V., Kavukcuoglu , K., Silver , D., Graves , A., Antonoglou , I., Wierstra , D. & Riedmiller , M. 2013. Playing atari with deep reinforcement learning. arXiv preprint .

    Google Scholar

    Ng , A. Y., Parr , R. & Koller , D. 1999. Policy search via density estimation. In Advances in Neural Information Processing Systems (NIPS), Denver, CO, USA, 1022–1028.

    Google Scholar

    Packer , C., Gao , K., Kos , J., Krähenbühl , P., Koltun , V. & Song , D. 2018. Assessing generalization in deep reinforcement learning.

    Google Scholar

    Parisotto , E., Ba , J. L. & Salakhutdinov , R. 2016. Actor-mimic: deep multitask and transfer reinforcement learning. In Proceedings of the International Conference on Learning Representations (ICLR).

    Google Scholar

    Powell , W. B. & Ma , J. 2011. A review of stochastic algorithms with continuous value function approximation and some new approximate policy iteration algorithms for multidimensional continuous applications. Journal of Control Theory and Applications 9(3), 336–352.

    Google Scholar

    Puterman , M. L. 2005. Markov Decision Processes: Discrete Stochastic Dynamic Programming, 2nd edition. John Wiley & Sons.

    Google Scholar

    Ramaswamy , A. & Bhatnagar , S. 2017. A generalization of the Borkar-Meyn theorem for stochastic recursive inclusions. Mathematics of Operations Research 42(3), 648–661.

    Google Scholar

    Sayed , A. H. 2014. Adaptation, learning, and optimization over networks. Foundations and Trends in Machine Learning 7(4–5), 311–801.

    Google Scholar

    Scherrer , B. 2010. Should one compute the temporal difference fix point or minimize the Bellman residual? The unified oblique projection view. In Proceedings of the International Conference on Machine Learning (ICML), 959–966.

    Google Scholar

    Schulman , J., Moritz , P., Levine , S., Jordan , M. & Abbeel , P. 2015. High-dimensional continuous control using generalized advantage estimation. arXiv preprint .

    Google Scholar

    Sutton , R. S., Maei , H. R., Precup , D., Bhatnagar , S., Silver , D., Szepesvari , C. & Wiewiora , E. 2009. Fast gradient-descent methods for temporal-difference learning with linear function approximation. In Proceedings of the International Conference on Machine Learning (ICML), 993–1000.

    Google Scholar

    Sutton , R. S., Mcallester , D., Singh , S. & Mansour , Y. 1999. Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems (NIPS), 1057–1063.

    Google Scholar

    Tadic , V. B. 2004. Almost sure convergence of two time-scale stochastic approximation algorithms. In IEEE American Control Conference, 4, 3802–3807.

    Google Scholar

    Taylor , M. E. & Stone , P. 2009. Transfer learning for reinforcement learning domains: a survey. Journal of Machine Learning Research 10, 1633–1685.

    Google Scholar

    Teh , Y. W., Bapst , V., Czarnecki , W. M., Quan , J., Kirkpatrick , J., Hadsell , R., Heess , N. & Pascanu , R. 2017. Distral: robust multitask reinforcement learning. arXiv preprint .

    Google Scholar

    Tieleman , T. & Hinton , G. 2012. Lecture 6.5- RMSProp: divide the gradient by a running average of its recent magnitude.

    Google Scholar

    Tomczak , M. B., Valcarcel Macua , S., de Cote , E. M. & Vrancx , P. 2019. Compatible features for monotonic policy improvement.

    Google Scholar

    Tutunov , R., Bou-Ammar , H. & Jadbabaie , A. 2016. An exact distributed newton method for reinforcement learning. In IEEE Conference on Decision and Control (CDC), 1003–1008.

    Google Scholar

    Valcarcel Macua , S. 2017. Distributed Optimization, Control and Learning in Multiagent Networks. PhD thesis, Universidad Politécnica de Madrid.

    Google Scholar

    Valcarcel Macua , S., Chen , J., Zazo , S. & Sayed , A. H. 2015. Distributed policy evaluation under multiple behavior strategies. IEEE Transactions on Automatic Control 60(5), 1260–1274.

    Google Scholar

    Valcarcel Macua , S., Tukiainen , A., Hernández , D. G.-O., Baldazo , D., de Cote , E. M. & Zazo , S. 2017. Diff-DAC: distributed actor-critic for average multitask deep reinforcement learning. arXiv preprint .

    Google Scholar

    van der Meulen , R. 2015. Gartner says 6.4 billion connected ‘things’ will be in use in 2016, up 30 percent from 2015. http://www.gartner.com/newsroom/id/3165317.

    Google Scholar

    Van Hasselt , H. 2012. Reinforcement learning in continuous state and action spaces. In Reinforcement Learning. Springer, 207–251.

    Google Scholar

    Wei , E. & Ozdaglar , A. 2012. Distributed alternating direction method of multipliers. In IEEE Annual Conf. Decision and Control (CDC), 5445–5450.

    Google Scholar

    Weinstein , A. & Littman , M. L. 2012. Bandit-based planning and learning in continuous-action markov decision processes. In International Conference on Automated Planning and Scheduling (ICAPS).

    Google Scholar

    Wierstra , D., Schaul , T., Glasmachers , T., Sun , Y., Peters , J. & Schmidhuber , J. 2014. Natural evolution strategies. Journal of Machine Learning Research 15(1), 949–980.

    Google Scholar

    Williams , R. J. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning 8(3–4), 229–256.

    Google Scholar

    Yaji , V. G. & Bhatnagar , S. 2020. Stochastic recursive inclusions in two timescales with nonadditive iterate-dependent Markov noise. Mathematics of Operations Research.

    Google Scholar

    Zhang , K., Yang , Z., Liu , H., Zhang , T. & Basar , T. 2018. Fully decentralized multi-agent reinforcement learning with networked agents. In Proceedings International Conference on Machine Learning (ICML), 5872–5881.

    Google Scholar

    Zhao , X. & Sayed , A. H. 2012. Performance limits for distributed estimation over LMS adaptive networks. IEEE Transactions on Signal Processing 60(10), 5107–5124.

    Google Scholar

    Zhao , X. & Sayed , A. H. 2015. Asynchronous adaptation and learning over networks—part i: modeling and stability analysis. IEEE Transactions on Signal Processing 63(4), 811–826.

    Google Scholar

  • Cite this article

    Sergio Valcarcel Macua, Ian Davies, Aleksi Tukiainen, Enrique Munoz de Cote. 2021. Fully distributed actor-critic architecture for multitask deep reinforcement learning. The Knowledge Engineering Review 36(1), doi: 10.1017/S0269888921000023
    Sergio Valcarcel Macua, Ian Davies, Aleksi Tukiainen, Enrique Munoz de Cote. 2021. Fully distributed actor-critic architecture for multitask deep reinforcement learning. The Knowledge Engineering Review 36(1), doi: 10.1017/S0269888921000023

Article Metrics

Article views(57) PDF downloads(68)

RESEARCH ARTICLE   Open Access    

Fully distributed actor-critic architecture for multitask deep reinforcement learning

Abstract: Abstract: We propose a fully distributed actor-critic architecture, named diffusion-distributed-actor-critic Diff-DAC, with application to multitask reinforcement learning (MRL). During the learning process, agents communicate their value and policy parameters to their neighbours, diffusing the information across a network of agents with no need for a central station. Each agent can only access data from its local task, but aims to learn a common policy that performs well for the whole set of tasks. The architecture is scalable, since the computational and communication cost per agent depends on the number of neighbours rather than the overall number of agents. We derive Diff-DAC from duality theory and provide novel insights into the actor-critic framework, showing that it is actually an instance of the dual-ascent method. We prove almost sure convergence of Diff-DAC to a common policy under general assumptions that hold even for deep neural network approximations. For more restrictive assumptions, we also prove that this common policy is a stationary point of an approximation of the original problem. Numerical results on multitask extensions of common continuous control benchmarks demonstrate that Diff-DAC stabilises learning and has a regularising effect that induces higher performance and better generalisation properties than previous architectures.

    • We thank David Baldazo, Daniel García-Ocaña Hernández, and Santiago Zazo for insightful preliminary discussions; Felix Leibfried for his support with the experiments; Peter Vrancx, Haitham Bou-Ammar, and Rasul Tutunov for helpful comments; and the anonymous reviewers and the Deputy Editor Patrick Mannion for their comments and suggestions that have helped to improve the presentation of the paper.

    • The authors declare none.

    • Work is done while affiliated with Secondmind.

    • After the appearance of a preliminary version of this draft: Valcarcel Macua et al. (2017).

    • We use boldface font to denote random variables and regular font to denote instances or deterministic variables.

    • See (19) below.

    • For ease of exposition, we assume that each agent is allocated with one task, similar to El Bsat et al. (2017). The extension to multiple tasks per agent is trivial.

    • Note $v^{\star}$ is unique, while there might be multiple optimal dual variables.

    • See, for example, Ng et al. (1999), Konda Tsitsiklis (2003), Melo and Lopes (2008), Bhatnagar et al. (2009), Powell and Ma (2011), Van Hasselt (2012), Weinstein and Littman (2012), Wierstra et al. (2014), Lillicrap et al. (2015), Heess et al. (2015), Schulman et al. (2015).

    • During experimentation we observed similar results across runs with RMSProp and Adam optimisers.

    • a.s. stands for almost surely.

    • Table A.1 in Appendix C.4 presents the results in terms of performance relative to specialised agents in full. Relative performance values are calculated from Table 2.

    • © The Author(s), 2021. Published by Cambridge University Press2021Cambridge University Press
References (62)
  • About this article
    Cite this article
    Sergio Valcarcel Macua, Ian Davies, Aleksi Tukiainen, Enrique Munoz de Cote. 2021. Fully distributed actor-critic architecture for multitask deep reinforcement learning. The Knowledge Engineering Review 36(1), doi: 10.1017/S0269888921000023
    Sergio Valcarcel Macua, Ian Davies, Aleksi Tukiainen, Enrique Munoz de Cote. 2021. Fully distributed actor-critic architecture for multitask deep reinforcement learning. The Knowledge Engineering Review 36(1), doi: 10.1017/S0269888921000023
  • Catalog

      /

      DownLoad:  Full-Size Img  PowerPoint
      Return
      Return