An online scalarization multi-objective reinforcement learning algorithm: TOPSIS Q-learning

Mohammad Mirzanejad; Morteza Ebrahimi; Peter Vamplew; Hadi Veisi; Mohammad Mirzanejad; Morteza Ebrahimi; Peter Vamplew; Hadi Veisi

doi:10.1017/S0269888921000163

¹Faculty of New Sciences and Technologies, University of Tehran, Tehran, Iran; e-mail: mirzanejad@ut.ac.ir; mo.ebrahimi@ut.ac.ir; h.veisi@ut.ac.ir"/> ²School of Engineering, Information Technology and Physical Sciences, Federation University Australia, Ballarat, Australia; e-mail: p.vamplew@federation.edu.au"/>

2022 Volume 37

Article Contents

Next Previous

RESEARCH ARTICLE Open Access

An online scalarization multi-objective reinforcement learning algorithm: TOPSIS Q-learning

¹Faculty of New Sciences and Technologies, University of Tehran, Tehran, Iran; e-mail: mirzanejad@ut.ac.ir; mo.ebrahimi@ut.ac.ir; h.veisi@ut.ac.ir
²School of Engineering, Information Technology and Physical Sciences, Federation University Australia, Ballarat, Australia; e-mail: p.vamplew@federation.edu.au

More Information

Received: 25 March 2021
Revised: 14 October 2021
Accepted: 15 December 2021
Published online: 13 June 2022
The Knowledge Engineering Review 37, Article number: e7 (2022) | Cite this article

Abstract

Abstract: Conventional reinforcement learning focuses on problems with single objective. However, many problems have multiple objectives or criteria that may be independent, related, or contradictory. In such cases, multi-objective reinforcement learning is used to propose a compromise among the solutions to balance the objectives. TOPSIS is a multi-criteria decision method that selects the alternative with minimum distance from the positive ideal solution and the maximum distance from the negative ideal solution, so it can be used effectively in the decision-making process to select the next action. In this research a single-policy algorithm called TOPSIS Q-Learning is provided with focus on its performance in online mode. Unlike all single-policy methods, in the first version of the algorithm, there is no need for the user to specify the weights of the objectives. The user’s preferences may not be completely definite, so all weight preferences are combined together as decision criteria and a solution is generated by considering all these preferences at once and user can model the uncertainty and weight changes of objectives around their specified preferences of objectives. If the user only wants to apply the algorithm for a specific set of weights the second version of the algorithm efficiently accomplishes that.
Rights and permissions
© The Author(s), 2022. Published by Cambridge University Press2022Cambridge University Press

References

Barrett , L. & Narayanan , S. 2008. Learning all optimal policies with multiple criteria. In Proceedings of the 25th International Conference on Machine Learning, New York, NY, USA, pp. 41–47.

Google Scholar

Gabor , Z., Kalmar , Z. & Szepesvari , C. 1998. Multi-criteria reinforcement learning. In The Fifteenth International Conference on Machine Learning, San Francisco, CA, USA, pp. 197–205.

Google Scholar

Geibel , P. 2006. Reinforcement learning for MDPs with Constraints. In Machine Learning: ECML 2006, Lecture Notes in Computer Science, vol. 4212.

Google Scholar

Hwang , C. L. & Yoon , K. 1981. Multiple Attribute Decision Making: Methods and Applications, Lecture Notes in Economics and Mathematical Systems. Springer-Verlag.

Google Scholar

Hwang , C. L. & Yoon , K. 1981. Multiple Attribute Decision Making: Methods and Applications. Springer-Verlag.

Google Scholar

Issabekov , R. and Vamplew , P. 2012. An empirical comparison of two common multiobjective reinforcement learning algorithms. In AI 2012: Advances in Artificial Intelligence. Lecture Notes in Computer Science, vol. 7691, pp. 626–636.

Google Scholar

Keeney , R. L. & Raiffa , H. 1976. Decision with Multiple Objectives: Preferences and Value Tradeoffs. Wiley.

Google Scholar

MacCrimmon , K. R. & Toda , M. 1969. The experimental determination of indifference curves. The Review of Economic Studies, 36(4), 433–450.

Google Scholar

MacCrimmon , K. R. & Wehrung , D. A. 1977. Trade-off Analysis: The Indifference and Preferred Proportions Approaches, Conflicting Objectives in Decisions. Wiley, pp. 123–147.

Google Scholar

Moffaert , K. V. 2014. Multi-criteria reinforcement learning for sequential decision making problems, Ph.D. dissertation, Dept. Comput. Sci., Vrije Universiteit Brussel., Brussels, Belgium.

Google Scholar

Moffaert , K. V., Drugan , M. M. & Nowé , A. 2013. Scalarized multi-objective reinforcement learning: Novel design techniques. In IEEE ADPRL, Singapore, pp. 191–199.

Google Scholar

Moffaert , K. V. & Nowé , A. 2014. Multi-objective reinforcement learning using sets of pareto dominating policies. Journal of Machine Learning Research 15, 3483–3512.

Google Scholar

Nguyen , T. T., Nguyen , N. D., Vamplew , P., Nahavandi , S., Dazeley , R. & Lim , C. P. 2020. A multi-objective deep reinforcement learning framework. Engineering Applications of Artificial Intelligence 96.

Google Scholar

Roijers , D. M., Röpke , W., Nowe , A. & Radulescu , R. 2021. On following pareto-optimal policies in multi-objective planning and reinforcement learning. Paper Presented at Multi-Objective Decision Making Workshop 2021.

Google Scholar

Roijers , D. M., Vamplew , P., Whiteson , S. & Dazeley , R. 2013. A survey of multi-objective sequential decision-making. Journal of Artificial Intelligence Research 48(1), 67–113.

Google Scholar

Roijers , D. M., Zintgraf , L. M., Libin , P. & Nowé , A. 2018. Interactive multi-objective reinforcement learning in multi-armed bandits for any utility function. In ALA Workshop at FAIM, vol. 8.

Google Scholar

Roijers , D. M., Zintgraf , L. M., Libin , P., Reymond , M., Bargiacchi , E. & Nowé , A. 2020. Interactive multi-objective reinforcement learning in multi-armed bandits with gaussian process utility models. In ECML-PKDD 2020: Proceedings of the 2020 European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases.

Google Scholar

Roijers , D. M., Zintgraf , L. M. & Nowé , A. 2017. Interactive thompson sampling for multi-objective multi-armed bandits. In Algorithmic Decision Theory, ADT 2017, Lecture Notes in Computer Science, vol. 10576. Springer.

Google Scholar

Sutton , R. S. and Barto , A. G. 1998. Reinforcement Learning: An Introduction. Adaptive Computation and Machine Learning. MIT Press.

Google Scholar

Tsitsiklis , J. N. 1994. Asynchronous stochastic approximation and q-learning. Journal of Machine Learning 16(3), 185–202.

Google Scholar

Vamplew , P., Dazeley , R., Berry , A., Issabekov , R. & Dekker , E. 2011. Empirical evaluation methods for multiobjective reinforcement learning algorithms. Machine Learning 84, 51–80.

Google Scholar

Vamplew , P., Dazeley , R. & Foale , C. 2017. Softmax exploration strategies for multiobjective reinforcement learning. Neurocomputing, 263, 74–86.

Google Scholar

Vamplew , P., Issabekov , R., Dazeley , R., Foale , C., Berry , A., Moore , T. & Creighton , D. 2017. Steering approaches to Pareto-optimal multiobjective reinforcement learning. Neurocomputing 263, 26–38.

Google Scholar

Vamplew , P., Yearwood , J., Dazeley , R. & Berry , A. 2008. On the limitations of scalarization for multi-objective reinforcement learning of Pareto fronts. In AI 2008: Advances in Artificial Intelligence. Lecture Notes in Computer Science, vol. 5360, pp. 372–378.

Google Scholar

Watkins , C. 1989. Learning from delayed rewards, Ph.D. thesis, University of Cambridge, England.

Google Scholar

Yoon , K. 1980. Systems selection by multiple attribute decision making, Ph.D. Dissertation, Kansas State University, Manhattan, Kansas.

Google Scholar

About this article

Cite this article

Mohammad Mirzanejad, Morteza Ebrahimi, Peter Vamplew, Hadi Veisi. 2022. An online scalarization multi-objective reinforcement learning algorithm: TOPSIS Q-learning. The Knowledge Engineering Review. 37:163 doi: 10.1017/S0269888921000163

Mohammad Mirzanejad, Morteza Ebrahimi, Peter Vamplew, Hadi Veisi. 2022. An online scalarization multi-objective reinforcement learning algorithm: TOPSIS Q-learning. The Knowledge Engineering Review. 37:163 doi: 10.1017/S0269888921000163

Download PDF

Article Metrics

Article views(260) PDF downloads(1265)

An online scalarization multi-objective reinforcement learning algorithm: TOPSIS Q-learning

¹Faculty of New Sciences and Technologies, University of Tehran, Tehran, Iran; e-mail: mirzanejad@ut.ac.ir; mo.ebrahimi@ut.ac.ir; h.veisi@ut.ac.ir
²School of Engineering, Information Technology and Physical Sciences, Federation University Australia, Ballarat, Australia; e-mail: p.vamplew@federation.edu.au

Received: 25 March 2021
Revised: 14 October 2021
Accepted: 15 December 2021
Published online: 13 June 2022

The Knowledge Engineering Review 37, Article number: e7 (2022) | Cite this article

Abstract: Abstract: Conventional reinforcement learning focuses on problems with single objective. However, many problems have multiple objectives or criteria that may be independent, related, or contradictory. In such cases, multi-objective reinforcement learning is used to propose a compromise among the solutions to balance the objectives. TOPSIS is a multi-criteria decision method that selects the alternative with minimum distance from the positive ideal solution and the maximum distance from the negative ideal solution, so it can be used effectively in the decision-making process to select the next action. In this research a single-policy algorithm called TOPSIS Q-Learning is provided with focus on its performance in online mode. Unlike all single-policy methods, in the first version of the algorithm, there is no need for the user to specify the weights of the objectives. The user’s preferences may not be completely definite, so all weight preferences are combined together as decision criteria and a solution is generated by considering all these preferences at once and user can model the uncertainty and weight changes of objectives around their specified preferences of objectives. If the user only wants to apply the algorithm for a specific set of weights the second version of the algorithm efficiently accomplishes that.

HTML

Rights and permissions

References (26)

About this article

Cite this article

DownLoad: Full-Size Img PowerPoint

Return

{{lists.name}}

An online scalarization multi-objective reinforcement learning algorithm: TOPSIS Q-learning

Abstract