An empirical investigation of value-based multi-objective reinforcement learning for stochastic environments

Kewen Ding; Peter Vamplew; Cameron Foale; Richard Dazeley; Kewen Ding; Peter Vamplew; Cameron Foale; Richard Dazeley

doi:10.1017/S0269888925100052

2025 Volume 40

Article Contents

Next Previous

RESEARCH ARTICLE Open Access

An empirical investigation of value-based multi-objective reinforcement learning for stochastic environments

¹Federation University Australia, Mt Helen, VIC, Australia
²Deakin University, Geelong, VIC, Australia

More Information

Corresponding author: Corresponding author: Peter Vamplew; Email: p.vamplew@federation.edu.au

Received: 27 November 2024
Revised: 17 July 2025
Accepted: 17 July 2025
Published online: 15 August 2025
The Knowledge Engineering Review 40, Article number: e6 (2025) | Cite this article

Abstract

Abstract: One common approach to solve multi-objective reinforcement learning (MORL) problems is to extend conventional Q-learning by using vector Q-values in combination with a utility function. However issues can arise with this approach in the context of stochastic environments, particularly when optimising for the scalarised expected reward (SER) criterion. This paper extends prior research, providing a detailed examination of the factors influencing the frequency with which value-based MORL Q-learning algorithms learn the SER-optimal policy for an environment with stochastic state transitions. We empirically examine several variations of the core multi-objective Q-learning algorithm as well as reward engineering approaches and demonstrate the limitations of these methods. In particular, we highlight the critical impact of the noisy Q-value estimates issue on the stability and convergence of these algorithms.
- multi-objective reinforcement learning,
- reinforcement learning
Rights and permissions
This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.

References

Abels , A., Roijers , D., Lenaerts , T., Nowé , A. & Steckelmacher , D. 2019. Dynamic weights in multi-objective deep reinforcement learning. In International Conference on Machine Learning (ICML), 11–20.

Google Scholar

Alegre , L. N., Bazzan , A. L., Roijers , D. M., Nowé , A. & da Silva , B. C. 2023. Sample-efficient multi-objective learning via generalized policy improvement prioritization. arXiv preprint arXiv:230107784.

Google Scholar

Bai , Q., Agarwal , M. & Aggarwal , V. 2021. Joint optimization of multi-objective reinforcement learning with policy gradient based algorithm. arXiv preprint arXiv: 210514125.10.1613/jair.1.13981

Google Scholar

Basaklar , T., Gumussoy , S. & Ogras , U. Y. 2022. Pd-morl: Preference-driven multi-objective reinforcement learning algorithm. arXiv preprint arXiv:220807914.

Google Scholar

Bryce , D., Cushing , W. & Kambhampati , S. 2007. Probabilistic Planning Is Multi-Objective. Arizona State University Computer Science and Engineering Technical Report 07-006.

Google Scholar

Cai , X. Q., Zhang , P., Zhao , L., Bian , J., Sugiyama , M. & Llorens , A. 2023. Distributional pareto-optimal multi-objective reinforcement learning. In Advances in Neural Information Processing Systems 36, 15593–15613.

Google Scholar

Ding , K. 2022. Addressing the issue of stochastic environments and local decision-making in multi-objective reinforcement learning. arXiv preprint arXiv: 221108669.

Google Scholar

Dornheim , J. 2022. gTLO: A generalized and non-linear multi-objective deep reinforcement learning approach. arXiv preprint arXiv: 220404988.

Google Scholar

Drugan , M. M. & Nowe , A. 2013. Designing multi-objective multi-armed bandits algorithms: A study. In The 2013 International Joint Conference on Neural Networks (IJCNN), 1–8. IEEE.10.1109/IJCNN.2013.6707036

Google Scholar

Fan , Z., Peng , N., Tian , M. & Fain , B. 2022. Welfare and fairness in multi-objective reinforcement learning. arXiv preprint arXiv:221201382.

Google Scholar

Felten , F., Alegre , L. N., Nowé , A., Bazzan , A. L. C., Talbi , E. G., Danoy , G. & da Silva , B. C. 2023. A toolkit for reliable benchmarking and research in multi-objective reinforcement learning. In Proceedings of the 37th Conference on Neural Information Processing Systems (NeurIPS 2023).

Google Scholar

Gábor , Z., Kalmár , Z. & Szepesvári , C. 1998. Multi-criteria reinforcement learning. In ICML, 98, 197–205.

Google Scholar

Hayes , C. F., Howley , E. & Mannion , P. 2020. Dynamic thresholded lexicograpic ordering. In Adaptive and Learning Agents Workshop (AAMAS 2020).

Google Scholar

Hayes , C. F., Rădulescu , R., Bargiacchi , E., Källström , J., Macfarlane , M., Reymond , M., Verstraeten , T., Zintgraf , L., Dazeley , R., Heintz , F., Howley , E., Irissappane , A., Mannion , P., Nowé , A., Ramos , G., Restelli , M., Vamplew , P. & Roijers , D. 2022b. A practical guide to multi-objective reinforcement learning and planning. In Autonomous Agents and Multi-Agent Systems 36. DOI 10.1007/s10458-022-09552-y10.1007/s10458-022-09552-y

Google Scholar

Hayes , C. F., Roijers , D. M., Howley , E. & Mannion , P. 2022a. Multi-objective distributional value iteration. In Adaptive and Learning Agents Workshop (AAMAS 2022).

Google Scholar

Huanca-Anquise , C. A., Bazzan , A. L. C. & Tavares , A. R. 2023. Multi-objective, multi-armed bandits: Algorithms for repeated games and application to route choice. Revista de Informática Teórica e Aplicada 30(1), 11–23.10.22456/2175-2745.122929

Google Scholar

Issabekov , R. & Vamplew , P. 2012. An empirical comparison of two common multiobjective reinforcement learning algorithms. In AI 2012: Advances in Artificial Intelligence, Thielscher , M. & Zhang , D. (eds). Lecture Notes in Computer Science, 626–636.

Google Scholar

Jin , J. & Ma , X. 2017. A multi-objective multi-agent framework for traffic light control. In 2017 11th Asian Control Conference (ASCC), 1199–1204. IEEE.10.1109/ASCC.2017.8287341

Google Scholar

Kulkarni , T. D., Saeedi , A., Gautam , S. & Gershman , S. J. 2016. Deep successor reinforcement learning. https://arxiv.org/abs/1606.02396,

Google Scholar

Lian , Z., Lv , C. & Lu , W. 2023. Inkjet OLED printing planning based on deep reinforcement learning and reward-based TLO. Journal of Physics: Conference Series, IOP Publishing 2450, 012081.

Google Scholar

Lu , H., Herman , D. & Yu , Y. 2023. Multi-objective reinforcement learning: Convexity, stationarity and pareto optimality. In The Eleventh International Conference on Learning Representations.

Google Scholar

Machado , M. C., Barreto , A., Precup , D. & Bowling , M. 2023. Temporal abstraction in reinforcement learning with the successor representation. Journal of Machine Learning Research 24(80), 1–69.

Google Scholar

Parisi , S., Pirotta , M., Smacchia , N., Bascetta , L. & Restelli , M. 2014. Policy gradient approaches for multi-objective sequential decision making. In 2014 International Joint Conference on Neural Networks (IJCNN), 2323–2330. IEEE.10.1109/IJCNN.2014.6889738

Google Scholar

Reymond , M., Bargiacchi , E. & Nowé , A. 2022. Pareto conditioned networks. arXiv preprint arXiv:220405036.

Google Scholar

Roijers , D. M., Vamplew , P., Whiteson , S. & Dazeley , R. 2013. A survey of multi-objective sequential decision-making. Journal of Artificial Intelligence Research 48, 67–113.10.1613/jair.3987

Google Scholar

Röpke , W., Reymond , M., Mannion , P., Roijers , D. M., Nowé , A. & Rădulescu, R. 2024. Divide and conquer: Provably unveiling the pareto front with multi-objective reinforcement learning. arXiv preprint arXiv: 240207182.

Google Scholar

Siddique , U., Weng , P. & Zimmer , M. 2020. Learning fair policies in multi-objective (deep) reinforcement learning with average and discounted rewards. In International Conference on Machine Learning, 8905–8915. PMLR.

Google Scholar

Skalse , J., Hammond , L., Griffin , C.& Abate , A. 2022. Lexicographic multi-objective reinforcement learning. arXiv preprint arXiv: 221213769.10.24963/ijcai.2022/476

Google Scholar

Sutton , R. S. & Barto , A. G. 2018. Reinforcement Learning: An Introduction. MIT Press.

Google Scholar

Sutton , R. S., Precup , D. & Singh , S. 1999. Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence 112(1-2), 181–211.10.1016/S0004-3702(99)00052-1

Google Scholar

Tercan , A. 2022. Solving MDPs with Thresholded Lexicographic Ordering Using Reinforcement Learning. PhD thesis, Colorado State University.

Google Scholar

Tessler , C., Mankowitz , D. J. & Mannor , S. 2018. Reward constrained policy optimization. arXiv preprint arXiv: 180511074.

Google Scholar

Vamplew , P., Dazeley , R., Barker , E. & Kelarev , A. 2009. Constructing stochastic mixture policies for episodic multiobjective reinforcement learning tasks. In AJCAI, 340–349. Springer.10.1007/978-3-642-10439-8_35

Google Scholar

Vamplew , P., Dazeley , R., Berry , A., Issabekov , R. & Dekker , E. 2011. Empirical evaluation methods for multiobjective reinforcement learning algorithms. Machine Learning 84(1-2), 51–80.10.1007/s10994-010-5232-5

Google Scholar

Vamplew , P., Dazeley , R. & Foale , C. 2017. Softmax exploration strategies for multiobjective reinforcement learning. Neurocomputing 263, 74–86.10.1016/j.neucom.2016.09.141

Google Scholar

Vamplew , P., Foale , C., Dazeley , R. & Bignold , A. 2021. Potential-based multiobjective reinforcement learning approaches to low-impact agents for AI safety. Engineering Applications of Artificial Intelligence 100, 104186.10.1016/j.engappai.2021.104186

Google Scholar

Vamplew , P., Foale , C. & Dazeley , R. 2022a. The impact of environmental stochasticity on value-based multiobjective reinforcement learning. Neural Computing and Applications 34(3), 1783–1799. DOI 10.1007/s00521-021-05859-110.1007/s00521-021-05859-1 doi: 10.1007/s00521-021-05859-1

CrossRef Google Scholar

Vamplew , P., Foale , C. & Dazeley , R. 2024. Value function interference and greedy action selection in value-based multi-objective reinforcement learning. arXiv preprint arXiv: 240206266.

Google Scholar

Vamplew , P., Smith , B. J., Källström , J., Ramos , G., Rădulescu , R., Roijers , D. M., Hayes , C. F., Heintz , F., Mannion , P., Libin , P. J., et al. 2022b. Scalar reward is not enough: A response to Silver, Singh, Precup and Sutton (2021). Autonomous Agents and Multi-Agent Systems 36(2), 41.10.1007/s10458-022-09575-5

Google Scholar

Vamplew , P., Yearwood , J., Dazeley , R. & Berry , A. 2008. On the Limitations of Scalarisation for Multi-objective Reinforcement Learning of Pareto Fronts. Springer-Verlag.10.1007/978-3-540-89378-3_37

Google Scholar

Van Hasselt , H., Doron , Y., Strub , F., Hessel , M., Sonnerat , N. & Modayil , J. 2018. Deep reinforcement learning and the deadly triad. arXiv preprint arXiv: 181202648.

Google Scholar

Van Moffaert , K., Drugan , M. M. & Nowé , A. 2013. Scalarized multi-objective reinforcement learning: Novel design techniques. In 2013 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL).10.1109/ADPRL.2013.6615007

Google Scholar

Vincent , M. 2024. Nonlinear scalarization in stochastic multi-objective mdps. Neural Computing and Applications, 1–13. https://link.springer.com/article/10.1007/s00521-024-10504-8#citeas 10.1007/s00521-024-10504-8

Google Scholar

Xu , J., Tian , Y., Ma , P., Rus , D., Sueda , S. & Matusik , W. 2020. Prediction-guided multi-objective reinforcement learning for continuous robot control. In International Conference on Machine Learning, 10607–10616. PMLR.

Google Scholar

Yang , R., Sun , X. & Narasimhan , K. 2019. A generalized algorithm for multi-objective reinforcement learning and policy adaptation. In Advances in Neural Information Processing Systems 32.

Google Scholar

About this article

Cite this article

Kewen Ding, Peter Vamplew, Cameron Foale, Richard Dazeley. 2025. An empirical investigation of value-based multi-objective reinforcement learning for stochastic environments. The Knowledge Engineering Review. 40:52 doi: 10.1017/S0269888925100052

Kewen Ding, Peter Vamplew, Cameron Foale, Richard Dazeley. 2025. An empirical investigation of value-based multi-objective reinforcement learning for stochastic environments. The Knowledge Engineering Review. 40:52 doi: 10.1017/S0269888925100052

Download PDF

Article Metrics

Article views(366) PDF downloads(511)

{{lists.name}}

An empirical investigation of value-based multi-objective reinforcement learning for stochastic environments

Abstract

Rights and permissions

References

About this article

Cite this article

Article Metrics

Access History

Other Articles By Authors