|
Andreas , J., Klein , D. & Levine , S. 2017. Modular multitask reinforcement learning with policy sketches. In Proceedings of the International Conference on Machine Learning (ICML), 166–175.
Google Scholar
|
|
Arrow , K. J., Hurwicz , L. & Uzawa , H. 1958. Studies in Linear and Non-linear Programming. Stanford University Press.
Google Scholar
|
|
Assran , M., Romoff , J., Ballas , N., Pineau , J. & Rabbat , M. 2019. Gossip-based actor-learner architectures for deep reinforcement learning. In Advances in Neural Information Processing Systems (NIPS), 13320–13330.
Google Scholar
|
|
Baird III, L. C. 1993. Advantage Updating, Technical report, Wright Lab Wright-Patterson AFB OH.
Google Scholar
|
|
Bertsekas , D. P. 2009. Convex Optimization Theory. Athena Scientific.
Google Scholar
|
|
Bertsekas , D. P. 2012. Dynamic Programming and Optimal Control, 4th edition, 2. Athena Scientific.
Google Scholar
|
|
Bhatnagar , S., Sutton , R. S., Ghavamzadeh , M. & Lee , M. 2009. Natural actor-critic algorithms. Automatica 45(11), 2471–2482.
Google Scholar
|
|
Bianchi , P. & Jakubowicz , J. 2013. Convergence of a multi-agent projected stochastic gradient algorithm for non-convex optimization. IEEE Transactions on Automatic Control 58(2), 391–405.
Google Scholar
|
|
Borkar , V. 2008. Stochastic Approximation: A Dynamical Systems Viewpoint. Cambridge University Press.
Google Scholar
|
|
Borkar , V. S. 1997. Stochastic approximation with two time scales. Systems and Control Letters 29(5), 291–294.
Google Scholar
|
|
Borkar , V. S. & Meyn , S. 1999. The O.D.E. method for convergence of stochastic approximation and reinforcement learning. SIAM Journal on Control and Optimization 38, 447–469.
Google Scholar
|
|
Bou-Ammar , H., Eaton , E., Ruvolo , P. & Taylor , M. 2014. Online multi-task learning for policy gradient methods. In Proceedings of the International Conference on Machine Learning (ICML), 1206–1214.
Google Scholar
|
|
Boyd , S. & Vandenberghe , L. 2004. Convex Optimization. Cambridge University Press.
Google Scholar
|
|
Brockman , G., Cheung , V., Pettersson , L., Schneider , J., Schulman , J., Tang , J. & Zaremba , W. 2016. OpenAI Gym.
Google Scholar
|
|
Chen , J. & Sayed , A. H. 2013. Distributed Pareto optimization via diffusion strategies. IEEE Journal of Selected Topics in Signal Processing 7(2), 205–220.
Google Scholar
|
|
El Bsat , S., Bou-Ammar , H. & Taylor , M. E. 2017. Scalable multitask policy gradient reinforcement learning. In AAAI Conference on Artificial Intelligence (AAAI), 1847–1853.
Google Scholar
|
|
Espeholt , L., Soyer , H., Munos , R., Simonyan , K., Mnih , V., Ward , T., Doron , Y., Firoiu , V., Harley , T., Dunning , I. et al. 2018. Impala: scalable distributed deep-rl with importance weighted actor-learner architectures. In ICML.
Google Scholar
|
|
Fu , J., Levine , S. & Abbeel , P. 2016. One-shot learning of manipulation skills with online dynamics adaptation and neural network priors. In IEEE RSJ International Conference on Intelligent Robots and Systems (IROS), 4019–4026.
Google Scholar
|
|
Golub , G. & Van Loan , C. 1996. Matrix Computations. Johns Hopkins University Press.
Google Scholar
|
|
Grondman , I., Busoniu , L., Lopes , G. A. D. & Babuska , R. 2012. A survey of actor-critic reinforcement learning: standard and natural policy gradients. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) 42(6), 1291–1307.
Google Scholar
|
|
Heess , N., Wayne , G., Silver , D., Lillicrap , T., Erez , T. & Tassa , Y. 2015. Learning continuous control policies by stochastic value gradients. In Advances in Neural Information Processing Systems (NIPS), 2926–2934.
Google Scholar
|
|
Horn , R. & Johnson , C. 1990. Matrix Analysis. Cambridge University Press.
Google Scholar
|
|
Kar , S., Moura , J. M. F. & Poor , H. V. 2013. QD-learning: a collaborative distributed strategy for multi-agent reinforcement learning through consensus + innovations. IEEE Transactions on Signal Processing 61(7), 1848–1862.
Google Scholar
|
|
Karp , R. M. 1972. Reducibility among combinatorial problems. In Complexity of Computer Computations. Springer, 85–103.
Google Scholar
|
|
Kingma , D. & Ba , J. L. 2015. Adam: a method for stochastic optimization. In Proceedings of the International Conference on Learning Representations (ICLR).
Google Scholar
|
|
Kober , J. & Peters , J. R. 2009. Policy search for motor primitives in robotics. In Advances in Neural Information Processing Systems (NIPS), 849–856.
Google Scholar
|
|
Konda , V. R. & Tsitsiklis , J. N. 2003. On actor-critic algorithms. SIAM Journal on Control and Optimization 42(4), 1143–1166.
Google Scholar
|
|
Lakshminarayanan , C. & Bhatnagar , S. 2017. A stability criterion for two timescale stochastic approximation schemes. Automatica 79, 108–114.
Google Scholar
|
|
Lillicrap , T. P., Hunt , J. J., Pritzel , A., Heess , N., Erez , T., Tassa , Y., Silver , D. & Wierstra , D. 2015. Continuous control with deep reinforcement learning.
Google Scholar
|
|
Melo , F. S. & Lopes , M. 2008. Fitted natural actor-critic: a new algorithm for continuous state-action MDPs. In Machine Learning and Knowledge Discovery in Databases, 5212. Springer, 66–81.
Google Scholar
|
|
Mnih , V., Badia , A. P., Mirza , M., Graves , A., Lillicrap , T., Harley , T., Silver , D. & Kavukcuoglu , K. 2016. Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning, 1928–1937.
Google Scholar
|
|
Mnih , V., Kavukcuoglu , K., Silver , D., Graves , A., Antonoglou , I., Wierstra , D. & Riedmiller , M. 2013. Playing atari with deep reinforcement learning. arXiv preprint .
Google Scholar
|
|
Ng , A. Y., Parr , R. & Koller , D. 1999. Policy search via density estimation. In Advances in Neural Information Processing Systems (NIPS), Denver, CO, USA, 1022–1028.
Google Scholar
|
|
Packer , C., Gao , K., Kos , J., Krähenbühl , P., Koltun , V. & Song , D. 2018. Assessing generalization in deep reinforcement learning.
Google Scholar
|
|
Parisotto , E., Ba , J. L. & Salakhutdinov , R. 2016. Actor-mimic: deep multitask and transfer reinforcement learning. In Proceedings of the International Conference on Learning Representations (ICLR).
Google Scholar
|
|
Powell , W. B. & Ma , J. 2011. A review of stochastic algorithms with continuous value function approximation and some new approximate policy iteration algorithms for multidimensional continuous applications. Journal of Control Theory and Applications 9(3), 336–352.
Google Scholar
|
|
Puterman , M. L. 2005. Markov Decision Processes: Discrete Stochastic Dynamic Programming, 2nd edition. John Wiley & Sons.
Google Scholar
|
|
Ramaswamy , A. & Bhatnagar , S. 2017. A generalization of the Borkar-Meyn theorem for stochastic recursive inclusions. Mathematics of Operations Research 42(3), 648–661.
Google Scholar
|
|
Sayed , A. H. 2014. Adaptation, learning, and optimization over networks. Foundations and Trends in Machine Learning 7(4–5), 311–801.
Google Scholar
|
|
Scherrer , B. 2010. Should one compute the temporal difference fix point or minimize the Bellman residual? The unified oblique projection view. In Proceedings of the International Conference on Machine Learning (ICML), 959–966.
Google Scholar
|
|
Schulman , J., Moritz , P., Levine , S., Jordan , M. & Abbeel , P. 2015. High-dimensional continuous control using generalized advantage estimation. arXiv preprint .
Google Scholar
|
|
Sutton , R. S., Maei , H. R., Precup , D., Bhatnagar , S., Silver , D., Szepesvari , C. & Wiewiora , E. 2009. Fast gradient-descent methods for temporal-difference learning with linear function approximation. In Proceedings of the International Conference on Machine Learning (ICML), 993–1000.
Google Scholar
|
|
Sutton , R. S., Mcallester , D., Singh , S. & Mansour , Y. 1999. Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems (NIPS), 1057–1063.
Google Scholar
|
|
Tadic , V. B. 2004. Almost sure convergence of two time-scale stochastic approximation algorithms. In IEEE American Control Conference, 4, 3802–3807.
Google Scholar
|
|
Taylor , M. E. & Stone , P. 2009. Transfer learning for reinforcement learning domains: a survey. Journal of Machine Learning Research 10, 1633–1685.
Google Scholar
|
|
Teh , Y. W., Bapst , V., Czarnecki , W. M., Quan , J., Kirkpatrick , J., Hadsell , R., Heess , N. & Pascanu , R. 2017. Distral: robust multitask reinforcement learning. arXiv preprint .
Google Scholar
|
|
Tieleman , T. & Hinton , G. 2012. Lecture 6.5- RMSProp: divide the gradient by a running average of its recent magnitude.
Google Scholar
|
|
Tomczak , M. B., Valcarcel Macua , S., de Cote , E. M. & Vrancx , P. 2019. Compatible features for monotonic policy improvement.
Google Scholar
|
|
Tutunov , R., Bou-Ammar , H. & Jadbabaie , A. 2016. An exact distributed newton method for reinforcement learning. In IEEE Conference on Decision and Control (CDC), 1003–1008.
Google Scholar
|
|
Valcarcel Macua , S. 2017. Distributed Optimization, Control and Learning in Multiagent Networks. PhD thesis, Universidad Politécnica de Madrid.
Google Scholar
|
|
Valcarcel Macua , S., Chen , J., Zazo , S. & Sayed , A. H. 2015. Distributed policy evaluation under multiple behavior strategies. IEEE Transactions on Automatic Control 60(5), 1260–1274.
Google Scholar
|
|
Valcarcel Macua , S., Tukiainen , A., Hernández , D. G.-O., Baldazo , D., de Cote , E. M. & Zazo , S. 2017. Diff-DAC: distributed actor-critic for average multitask deep reinforcement learning. arXiv preprint .
Google Scholar
|
|
van der Meulen , R. 2015. Gartner says 6.4 billion connected ‘things’ will be in use in 2016, up 30 percent from 2015. http://www.gartner.com/newsroom/id/3165317.
Google Scholar
|
|
Van Hasselt , H. 2012. Reinforcement learning in continuous state and action spaces. In Reinforcement Learning. Springer, 207–251.
Google Scholar
|
|
Wei , E. & Ozdaglar , A. 2012. Distributed alternating direction method of multipliers. In IEEE Annual Conf. Decision and Control (CDC), 5445–5450.
Google Scholar
|
|
Weinstein , A. & Littman , M. L. 2012. Bandit-based planning and learning in continuous-action markov decision processes. In International Conference on Automated Planning and Scheduling (ICAPS).
Google Scholar
|
|
Wierstra , D., Schaul , T., Glasmachers , T., Sun , Y., Peters , J. & Schmidhuber , J. 2014. Natural evolution strategies. Journal of Machine Learning Research 15(1), 949–980.
Google Scholar
|
|
Williams , R. J. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning 8(3–4), 229–256.
Google Scholar
|
|
Yaji , V. G. & Bhatnagar , S. 2020. Stochastic recursive inclusions in two timescales with nonadditive iterate-dependent Markov noise. Mathematics of Operations Research.
Google Scholar
|
|
Zhang , K., Yang , Z., Liu , H., Zhang , T. & Basar , T. 2018. Fully decentralized multi-agent reinforcement learning with networked agents. In Proceedings International Conference on Machine Learning (ICML), 5872–5881.
Google Scholar
|
|
Zhao , X. & Sayed , A. H. 2012. Performance limits for distributed estimation over LMS adaptive networks. IEEE Transactions on Signal Processing 60(10), 5107–5124.
Google Scholar
|
|
Zhao , X. & Sayed , A. H. 2015. Asynchronous adaptation and learning over networks—part i: modeling and stability analysis. IEEE Transactions on Signal Processing 63(4), 811–826.
Google Scholar
|