doi:10.1017/S0269888921000023

Andreas , J., Klein , D. & Levine , S. 2017. Modular multitask reinforcement learning with policy sketches. In Proceedings of the International Conference on Machine Learning (ICML), 166–175.

Arrow , K. J., Hurwicz , L. & Uzawa , H. 1958. Studies in Linear and Non-linear Programming. Stanford University Press.

Assran , M., Romoff , J., Ballas , N., Pineau , J. & Rabbat , M. 2019. Gossip-based actor-learner architectures for deep reinforcement learning. In Advances in Neural Information Processing Systems (NIPS), 13320–13330.

Baird III, L. C. 1993. Advantage Updating, Technical report, Wright Lab Wright-Patterson AFB OH.

Bertsekas , D. P. 2009. Convex Optimization Theory. Athena Scientific.

Bertsekas , D. P. 2012. Dynamic Programming and Optimal Control, 4th edition, 2. Athena Scientific.

Bhatnagar , S., Sutton , R. S., Ghavamzadeh , M. & Lee , M. 2009. Natural actor-critic algorithms. Automatica 45(11), 2471–2482.

Bianchi , P. & Jakubowicz , J. 2013. Convergence of a multi-agent projected stochastic gradient algorithm for non-convex optimization. IEEE Transactions on Automatic Control 58(2), 391–405.

Borkar , V. 2008. Stochastic Approximation: A Dynamical Systems Viewpoint. Cambridge University Press.

Borkar , V. S. 1997. Stochastic approximation with two time scales. Systems and Control Letters 29(5), 291–294.

Borkar , V. S. & Meyn , S. 1999. The O.D.E. method for convergence of stochastic approximation and reinforcement learning. SIAM Journal on Control and Optimization 38, 447–469.

Bou-Ammar , H., Eaton , E., Ruvolo , P. & Taylor , M. 2014. Online multi-task learning for policy gradient methods. In Proceedings of the International Conference on Machine Learning (ICML), 1206–1214.

Boyd , S. & Vandenberghe , L. 2004. Convex Optimization. Cambridge University Press.

Brockman , G., Cheung , V., Pettersson , L., Schneider , J., Schulman , J., Tang , J. & Zaremba , W. 2016. OpenAI Gym.

Chen , J. & Sayed , A. H. 2013. Distributed Pareto optimization via diffusion strategies. IEEE Journal of Selected Topics in Signal Processing 7(2), 205–220.

El Bsat , S., Bou-Ammar , H. & Taylor , M. E. 2017. Scalable multitask policy gradient reinforcement learning. In AAAI Conference on Artificial Intelligence (AAAI), 1847–1853.

Espeholt , L., Soyer , H., Munos , R., Simonyan , K., Mnih , V., Ward , T., Doron , Y., Firoiu , V., Harley , T., Dunning , I. et al. 2018. Impala: scalable distributed deep-rl with importance weighted actor-learner architectures. In ICML.

Fu , J., Levine , S. & Abbeel , P. 2016. One-shot learning of manipulation skills with online dynamics adaptation and neural network priors. In IEEE RSJ International Conference on Intelligent Robots and Systems (IROS), 4019–4026.

Golub , G. & Van Loan , C. 1996. Matrix Computations. Johns Hopkins University Press.

Grondman , I., Busoniu , L., Lopes , G. A. D. & Babuska , R. 2012. A survey of actor-critic reinforcement learning: standard and natural policy gradients. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) 42(6), 1291–1307.

Heess , N., Wayne , G., Silver , D., Lillicrap , T., Erez , T. & Tassa , Y. 2015. Learning continuous control policies by stochastic value gradients. In Advances in Neural Information Processing Systems (NIPS), 2926–2934.

Horn , R. & Johnson , C. 1990. Matrix Analysis. Cambridge University Press.

Kar , S., Moura , J. M. F. & Poor , H. V. 2013. QD-learning: a collaborative distributed strategy for multi-agent reinforcement learning through consensus + innovations. IEEE Transactions on Signal Processing 61(7), 1848–1862.

Karp , R. M. 1972. Reducibility among combinatorial problems. In Complexity of Computer Computations. Springer, 85–103.

Kingma , D. & Ba , J. L. 2015. Adam: a method for stochastic optimization. In Proceedings of the International Conference on Learning Representations (ICLR).

Kober , J. & Peters , J. R. 2009. Policy search for motor primitives in robotics. In Advances in Neural Information Processing Systems (NIPS), 849–856.

Konda , V. R. & Tsitsiklis , J. N. 2003. On actor-critic algorithms. SIAM Journal on Control and Optimization 42(4), 1143–1166.

Lakshminarayanan , C. & Bhatnagar , S. 2017. A stability criterion for two timescale stochastic approximation schemes. Automatica 79, 108–114.

Lillicrap , T. P., Hunt , J. J., Pritzel , A., Heess , N., Erez , T., Tassa , Y., Silver , D. & Wierstra , D. 2015. Continuous control with deep reinforcement learning.

Melo , F. S. & Lopes , M. 2008. Fitted natural actor-critic: a new algorithm for continuous state-action MDPs. In Machine Learning and Knowledge Discovery in Databases, 5212. Springer, 66–81.

Mnih , V., Badia , A. P., Mirza , M., Graves , A., Lillicrap , T., Harley , T., Silver , D. & Kavukcuoglu , K. 2016. Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning, 1928–1937.

Mnih , V., Kavukcuoglu , K., Silver , D., Graves , A., Antonoglou , I., Wierstra , D. & Riedmiller , M. 2013. Playing atari with deep reinforcement learning. arXiv preprint .

Ng , A. Y., Parr , R. & Koller , D. 1999. Policy search via density estimation. In Advances in Neural Information Processing Systems (NIPS), Denver, CO, USA, 1022–1028.

Packer , C., Gao , K., Kos , J., Krähenbühl , P., Koltun , V. & Song , D. 2018. Assessing generalization in deep reinforcement learning.

Parisotto , E., Ba , J. L. & Salakhutdinov , R. 2016. Actor-mimic: deep multitask and transfer reinforcement learning. In Proceedings of the International Conference on Learning Representations (ICLR).

Powell , W. B. & Ma , J. 2011. A review of stochastic algorithms with continuous value function approximation and some new approximate policy iteration algorithms for multidimensional continuous applications. Journal of Control Theory and Applications 9(3), 336–352.

Puterman , M. L. 2005. Markov Decision Processes: Discrete Stochastic Dynamic Programming, 2nd edition. John Wiley & Sons.

Ramaswamy , A. & Bhatnagar , S. 2017. A generalization of the Borkar-Meyn theorem for stochastic recursive inclusions. Mathematics of Operations Research 42(3), 648–661.

Sayed , A. H. 2014. Adaptation, learning, and optimization over networks. Foundations and Trends in Machine Learning 7(4–5), 311–801.

Scherrer , B. 2010. Should one compute the temporal difference fix point or minimize the Bellman residual? The unified oblique projection view. In Proceedings of the International Conference on Machine Learning (ICML), 959–966.

Schulman , J., Moritz , P., Levine , S., Jordan , M. & Abbeel , P. 2015. High-dimensional continuous control using generalized advantage estimation. arXiv preprint .

Sutton , R. S., Maei , H. R., Precup , D., Bhatnagar , S., Silver , D., Szepesvari , C. & Wiewiora , E. 2009. Fast gradient-descent methods for temporal-difference learning with linear function approximation. In Proceedings of the International Conference on Machine Learning (ICML), 993–1000.

Sutton , R. S., Mcallester , D., Singh , S. & Mansour , Y. 1999. Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems (NIPS), 1057–1063.

Tadic , V. B. 2004. Almost sure convergence of two time-scale stochastic approximation algorithms. In IEEE American Control Conference, 4, 3802–3807.

Taylor , M. E. & Stone , P. 2009. Transfer learning for reinforcement learning domains: a survey. Journal of Machine Learning Research 10, 1633–1685.

Teh , Y. W., Bapst , V., Czarnecki , W. M., Quan , J., Kirkpatrick , J., Hadsell , R., Heess , N. & Pascanu , R. 2017. Distral: robust multitask reinforcement learning. arXiv preprint .

Tieleman , T. & Hinton , G. 2012. Lecture 6.5- RMSProp: divide the gradient by a running average of its recent magnitude.

Tomczak , M. B., Valcarcel Macua , S., de Cote , E. M. & Vrancx , P. 2019. Compatible features for monotonic policy improvement.

Tutunov , R., Bou-Ammar , H. & Jadbabaie , A. 2016. An exact distributed newton method for reinforcement learning. In IEEE Conference on Decision and Control (CDC), 1003–1008.

Valcarcel Macua , S. 2017. Distributed Optimization, Control and Learning in Multiagent Networks. PhD thesis, Universidad Politécnica de Madrid.

Valcarcel Macua , S., Chen , J., Zazo , S. & Sayed , A. H. 2015. Distributed policy evaluation under multiple behavior strategies. IEEE Transactions on Automatic Control 60(5), 1260–1274.

Valcarcel Macua , S., Tukiainen , A., Hernández , D. G.-O., Baldazo , D., de Cote , E. M. & Zazo , S. 2017. Diff-DAC: distributed actor-critic for average multitask deep reinforcement learning. arXiv preprint .

van der Meulen , R. 2015. Gartner says 6.4 billion connected ‘things’ will be in use in 2016, up 30 percent from 2015. http://www.gartner.com/newsroom/id/3165317.

Van Hasselt , H. 2012. Reinforcement learning in continuous state and action spaces. In Reinforcement Learning. Springer, 207–251.

Wei , E. & Ozdaglar , A. 2012. Distributed alternating direction method of multipliers. In IEEE Annual Conf. Decision and Control (CDC), 5445–5450.

Weinstein , A. & Littman , M. L. 2012. Bandit-based planning and learning in continuous-action markov decision processes. In International Conference on Automated Planning and Scheduling (ICAPS).

Wierstra , D., Schaul , T., Glasmachers , T., Sun , Y., Peters , J. & Schmidhuber , J. 2014. Natural evolution strategies. Journal of Machine Learning Research 15(1), 949–980.

Williams , R. J. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning 8(3–4), 229–256.

Yaji , V. G. & Bhatnagar , S. 2020. Stochastic recursive inclusions in two timescales with nonadditive iterate-dependent Markov noise. Mathematics of Operations Research.

Zhang , K., Yang , Z., Liu , H., Zhang , T. & Basar , T. 2018. Fully decentralized multi-agent reinforcement learning with networked agents. In Proceedings International Conference on Machine Learning (ICML), 5872–5881.

Zhao , X. & Sayed , A. H. 2012. Performance limits for distributed estimation over LMS adaptive networks. IEEE Transactions on Signal Processing 60(10), 5107–5124.

Zhao , X. & Sayed , A. H. 2015. Asynchronous adaptation and learning over networks—part i: modeling and stability analysis. IEEE Transactions on Signal Processing 63(4), 811–826.