Fully distributed actor-critic architecture for multitask deep reinforcement learning

Sergio Valcarcel Macua; Ian Davies; Aleksi Tukiainen; Enrique Munoz de Cote; Sergio Valcarcel Macua; Ian Davies; Aleksi Tukiainen; Enrique Munoz de Cote

doi:10.1017/S0269888921000023

2021 Volume 36

Article Contents

Next Previous

RESEARCH ARTICLE Open Access

Fully distributed actor-critic architecture for multitask deep reinforcement learning

Secondmind, Cambridge, CB2 1LA, UK e-mails: sergiovalmac@gmail.com, davies.ian.r@gmail.com, aleksi.tukiainen@gmail.com, enrique@people-ai.com

More Information

Received: 14 December 2018
Revised: 04 February 2021
Accepted: 23 February 2021
Published online: 16 April 2021
The Knowledge Engineering Review 36, Article number: e6 (2021) | Cite this article

Abstract

Abstract: We propose a fully distributed actor-critic architecture, named diffusion-distributed-actor-critic Diff-DAC, with application to multitask reinforcement learning (MRL). During the learning process, agents communicate their value and policy parameters to their neighbours, diffusing the information across a network of agents with no need for a central station. Each agent can only access data from its local task, but aims to learn a common policy that performs well for the whole set of tasks. The architecture is scalable, since the computational and communication cost per agent depends on the number of neighbours rather than the overall number of agents. We derive Diff-DAC from duality theory and provide novel insights into the actor-critic framework, showing that it is actually an instance of the dual-ascent method. We prove almost sure convergence of Diff-DAC to a common policy under general assumptions that hold even for deep neural network approximations. For more restrictive assumptions, we also prove that this common policy is a stationary point of an approximation of the original problem. Numerical results on multitask extensions of common continuous control benchmarks demonstrate that Diff-DAC stabilises learning and has a regularising effect that induces higher performance and better generalisation properties than previous architectures.
Rights and permissions
© The Author(s), 2021. Published by Cambridge University Press2021Cambridge University Press

References

Andreas , J., Klein , D. & Levine , S. 2017. Modular multitask reinforcement learning with policy sketches. In Proceedings of the International Conference on Machine Learning (ICML), 166–175.

Google Scholar

Arrow , K. J., Hurwicz , L. & Uzawa , H. 1958. Studies in Linear and Non-linear Programming. Stanford University Press.

Google Scholar

Assran , M., Romoff , J., Ballas , N., Pineau , J. & Rabbat , M. 2019. Gossip-based actor-learner architectures for deep reinforcement learning. In Advances in Neural Information Processing Systems (NIPS), 13320–13330.

Google Scholar

Baird III, L. C. 1993. Advantage Updating, Technical report, Wright Lab Wright-Patterson AFB OH.

Google Scholar

Bertsekas , D. P. 2009. Convex Optimization Theory. Athena Scientific.

Google Scholar

Bertsekas , D. P. 2012. Dynamic Programming and Optimal Control, 4th edition, 2. Athena Scientific.

Google Scholar

Bhatnagar , S., Sutton , R. S., Ghavamzadeh , M. & Lee , M. 2009. Natural actor-critic algorithms. Automatica 45(11), 2471–2482.

Google Scholar

Bianchi , P. & Jakubowicz , J. 2013. Convergence of a multi-agent projected stochastic gradient algorithm for non-convex optimization. IEEE Transactions on Automatic Control 58(2), 391–405.

Google Scholar

Borkar , V. 2008. Stochastic Approximation: A Dynamical Systems Viewpoint. Cambridge University Press.

Google Scholar

Borkar , V. S. 1997. Stochastic approximation with two time scales. Systems and Control Letters 29(5), 291–294.

Google Scholar

Borkar , V. S. & Meyn , S. 1999. The O.D.E. method for convergence of stochastic approximation and reinforcement learning. SIAM Journal on Control and Optimization 38, 447–469.

Google Scholar

Bou-Ammar , H., Eaton , E., Ruvolo , P. & Taylor , M. 2014. Online multi-task learning for policy gradient methods. In Proceedings of the International Conference on Machine Learning (ICML), 1206–1214.

Google Scholar

Boyd , S. & Vandenberghe , L. 2004. Convex Optimization. Cambridge University Press.

Google Scholar

Brockman , G., Cheung , V., Pettersson , L., Schneider , J., Schulman , J., Tang , J. & Zaremba , W. 2016. OpenAI Gym.

Google Scholar

Chen , J. & Sayed , A. H. 2013. Distributed Pareto optimization via diffusion strategies. IEEE Journal of Selected Topics in Signal Processing 7(2), 205–220.

Google Scholar

El Bsat , S., Bou-Ammar , H. & Taylor , M. E. 2017. Scalable multitask policy gradient reinforcement learning. In AAAI Conference on Artificial Intelligence (AAAI), 1847–1853.

Google Scholar

Espeholt , L., Soyer , H., Munos , R., Simonyan , K., Mnih , V., Ward , T., Doron , Y., Firoiu , V., Harley , T., Dunning , I. et al. 2018. Impala: scalable distributed deep-rl with importance weighted actor-learner architectures. In ICML.

Google Scholar

Fu , J., Levine , S. & Abbeel , P. 2016. One-shot learning of manipulation skills with online dynamics adaptation and neural network priors. In IEEE RSJ International Conference on Intelligent Robots and Systems (IROS), 4019–4026.

Google Scholar

Golub , G. & Van Loan , C. 1996. Matrix Computations. Johns Hopkins University Press.

Google Scholar

Grondman , I., Busoniu , L., Lopes , G. A. D. & Babuska , R. 2012. A survey of actor-critic reinforcement learning: standard and natural policy gradients. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) 42(6), 1291–1307.

Google Scholar

Heess , N., Wayne , G., Silver , D., Lillicrap , T., Erez , T. & Tassa , Y. 2015. Learning continuous control policies by stochastic value gradients. In Advances in Neural Information Processing Systems (NIPS), 2926–2934.

Google Scholar

Horn , R. & Johnson , C. 1990. Matrix Analysis. Cambridge University Press.

Google Scholar

Kar , S., Moura , J. M. F. & Poor , H. V. 2013. QD-learning: a collaborative distributed strategy for multi-agent reinforcement learning through consensus + innovations. IEEE Transactions on Signal Processing 61(7), 1848–1862.

Google Scholar

Karp , R. M. 1972. Reducibility among combinatorial problems. In Complexity of Computer Computations. Springer, 85–103.

Google Scholar

Kingma , D. & Ba , J. L. 2015. Adam: a method for stochastic optimization. In Proceedings of the International Conference on Learning Representations (ICLR).

Google Scholar

Kober , J. & Peters , J. R. 2009. Policy search for motor primitives in robotics. In Advances in Neural Information Processing Systems (NIPS), 849–856.

Google Scholar

Konda , V. R. & Tsitsiklis , J. N. 2003. On actor-critic algorithms. SIAM Journal on Control and Optimization 42(4), 1143–1166.

Google Scholar

Lakshminarayanan , C. & Bhatnagar , S. 2017. A stability criterion for two timescale stochastic approximation schemes. Automatica 79, 108–114.

Google Scholar

Lillicrap , T. P., Hunt , J. J., Pritzel , A., Heess , N., Erez , T., Tassa , Y., Silver , D. & Wierstra , D. 2015. Continuous control with deep reinforcement learning.

Google Scholar

Melo , F. S. & Lopes , M. 2008. Fitted natural actor-critic: a new algorithm for continuous state-action MDPs. In Machine Learning and Knowledge Discovery in Databases, 5212. Springer, 66–81.

Google Scholar

Mnih , V., Badia , A. P., Mirza , M., Graves , A., Lillicrap , T., Harley , T., Silver , D. & Kavukcuoglu , K. 2016. Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning, 1928–1937.

Google Scholar

Mnih , V., Kavukcuoglu , K., Silver , D., Graves , A., Antonoglou , I., Wierstra , D. & Riedmiller , M. 2013. Playing atari with deep reinforcement learning. arXiv preprint .

Google Scholar

Ng , A. Y., Parr , R. & Koller , D. 1999. Policy search via density estimation. In Advances in Neural Information Processing Systems (NIPS), Denver, CO, USA, 1022–1028.

Google Scholar

Packer , C., Gao , K., Kos , J., Krähenbühl , P., Koltun , V. & Song , D. 2018. Assessing generalization in deep reinforcement learning.

Google Scholar

Parisotto , E., Ba , J. L. & Salakhutdinov , R. 2016. Actor-mimic: deep multitask and transfer reinforcement learning. In Proceedings of the International Conference on Learning Representations (ICLR).

Google Scholar

Powell , W. B. & Ma , J. 2011. A review of stochastic algorithms with continuous value function approximation and some new approximate policy iteration algorithms for multidimensional continuous applications. Journal of Control Theory and Applications 9(3), 336–352.

Google Scholar

Puterman , M. L. 2005. Markov Decision Processes: Discrete Stochastic Dynamic Programming, 2nd edition. John Wiley & Sons.

Google Scholar

Ramaswamy , A. & Bhatnagar , S. 2017. A generalization of the Borkar-Meyn theorem for stochastic recursive inclusions. Mathematics of Operations Research 42(3), 648–661.

Google Scholar

Sayed , A. H. 2014. Adaptation, learning, and optimization over networks. Foundations and Trends in Machine Learning 7(4–5), 311–801.

Google Scholar

Scherrer , B. 2010. Should one compute the temporal difference fix point or minimize the Bellman residual? The unified oblique projection view. In Proceedings of the International Conference on Machine Learning (ICML), 959–966.

Google Scholar

Schulman , J., Moritz , P., Levine , S., Jordan , M. & Abbeel , P. 2015. High-dimensional continuous control using generalized advantage estimation. arXiv preprint .

Google Scholar

Sutton , R. S., Maei , H. R., Precup , D., Bhatnagar , S., Silver , D., Szepesvari , C. & Wiewiora , E. 2009. Fast gradient-descent methods for temporal-difference learning with linear function approximation. In Proceedings of the International Conference on Machine Learning (ICML), 993–1000.

Google Scholar

Sutton , R. S., Mcallester , D., Singh , S. & Mansour , Y. 1999. Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems (NIPS), 1057–1063.

Google Scholar

Tadic , V. B. 2004. Almost sure convergence of two time-scale stochastic approximation algorithms. In IEEE American Control Conference, 4, 3802–3807.

Google Scholar

Taylor , M. E. & Stone , P. 2009. Transfer learning for reinforcement learning domains: a survey. Journal of Machine Learning Research 10, 1633–1685.

Google Scholar

Teh , Y. W., Bapst , V., Czarnecki , W. M., Quan , J., Kirkpatrick , J., Hadsell , R., Heess , N. & Pascanu , R. 2017. Distral: robust multitask reinforcement learning. arXiv preprint .

Google Scholar

Tieleman , T. & Hinton , G. 2012. Lecture 6.5- RMSProp: divide the gradient by a running average of its recent magnitude.

Google Scholar

Tomczak , M. B., Valcarcel Macua , S., de Cote , E. M. & Vrancx , P. 2019. Compatible features for monotonic policy improvement.

Google Scholar

Tutunov , R., Bou-Ammar , H. & Jadbabaie , A. 2016. An exact distributed newton method for reinforcement learning. In IEEE Conference on Decision and Control (CDC), 1003–1008.

Google Scholar

Valcarcel Macua , S. 2017. Distributed Optimization, Control and Learning in Multiagent Networks. PhD thesis, Universidad Politécnica de Madrid.

Google Scholar

Valcarcel Macua , S., Chen , J., Zazo , S. & Sayed , A. H. 2015. Distributed policy evaluation under multiple behavior strategies. IEEE Transactions on Automatic Control 60(5), 1260–1274.

Google Scholar

Valcarcel Macua , S., Tukiainen , A., Hernández , D. G.-O., Baldazo , D., de Cote , E. M. & Zazo , S. 2017. Diff-DAC: distributed actor-critic for average multitask deep reinforcement learning. arXiv preprint .

Google Scholar

van der Meulen , R. 2015. Gartner says 6.4 billion connected ‘things’ will be in use in 2016, up 30 percent from 2015. http://www.gartner.com/newsroom/id/3165317.

Google Scholar

Van Hasselt , H. 2012. Reinforcement learning in continuous state and action spaces. In Reinforcement Learning. Springer, 207–251.

Google Scholar

Wei , E. & Ozdaglar , A. 2012. Distributed alternating direction method of multipliers. In IEEE Annual Conf. Decision and Control (CDC), 5445–5450.

Google Scholar

Weinstein , A. & Littman , M. L. 2012. Bandit-based planning and learning in continuous-action markov decision processes. In International Conference on Automated Planning and Scheduling (ICAPS).

Google Scholar

Wierstra , D., Schaul , T., Glasmachers , T., Sun , Y., Peters , J. & Schmidhuber , J. 2014. Natural evolution strategies. Journal of Machine Learning Research 15(1), 949–980.

Google Scholar

Williams , R. J. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning 8(3–4), 229–256.

Google Scholar

Yaji , V. G. & Bhatnagar , S. 2020. Stochastic recursive inclusions in two timescales with nonadditive iterate-dependent Markov noise. Mathematics of Operations Research.

Google Scholar

Zhang , K., Yang , Z., Liu , H., Zhang , T. & Basar , T. 2018. Fully decentralized multi-agent reinforcement learning with networked agents. In Proceedings International Conference on Machine Learning (ICML), 5872–5881.

Google Scholar

Zhao , X. & Sayed , A. H. 2012. Performance limits for distributed estimation over LMS adaptive networks. IEEE Transactions on Signal Processing 60(10), 5107–5124.

Google Scholar

Zhao , X. & Sayed , A. H. 2015. Asynchronous adaptation and learning over networks—part i: modeling and stability analysis. IEEE Transactions on Signal Processing 63(4), 811–826.

Google Scholar

About this article

Cite this article

Sergio Valcarcel Macua, Ian Davies, Aleksi Tukiainen, Enrique Munoz de Cote. 2021. Fully distributed actor-critic architecture for multitask deep reinforcement learning. The Knowledge Engineering Review. 36:23 doi: 10.1017/S0269888921000023

Sergio Valcarcel Macua, Ian Davies, Aleksi Tukiainen, Enrique Munoz de Cote. 2021. Fully distributed actor-critic architecture for multitask deep reinforcement learning. The Knowledge Engineering Review. 36:23 doi: 10.1017/S0269888921000023

Download PDF

Article Metrics

Article views(262) PDF downloads(263)

{{lists.name}}

Fully distributed actor-critic architecture for multitask deep reinforcement learning

Abstract

Rights and permissions

References

About this article

Cite this article

Article Metrics

Access History

Other Articles By Authors