Federation University Australia, Mt Helen, VIC, Australia"/> Deakin University, Geelong, VIC, Australia"/>
Search
2025 Volume 40
Article Contents
RESEARCH ARTICLE   Open Access    

An empirical investigation of value-based multi-objective reinforcement learning for stochastic environments

More Information
  • Abstract: One common approach to solve multi-objective reinforcement learning (MORL) problems is to extend conventional Q-learning by using vector Q-values in combination with a utility function. However issues can arise with this approach in the context of stochastic environments, particularly when optimising for the scalarised expected reward (SER) criterion. This paper extends prior research, providing a detailed examination of the factors influencing the frequency with which value-based MORL Q-learning algorithms learn the SER-optimal policy for an environment with stochastic state transitions. We empirically examine several variations of the core multi-objective Q-learning algorithm as well as reward engineering approaches and demonstrate the limitations of these methods. In particular, we highlight the critical impact of the noisy Q-value estimates issue on the stability and convergence of these algorithms.
  • 加载中
  • Abels , A., Roijers , D., Lenaerts , T., Nowé , A. & Steckelmacher , D. 2019. Dynamic weights in multi-objective deep reinforcement learning. In International Conference on Machine Learning (ICML), 11–20.

    Google Scholar

    Alegre , L. N., Bazzan , A. L., Roijers , D. M., Nowé , A. & da Silva , B. C. 2023. Sample-efficient multi-objective learning via generalized policy improvement prioritization. arXiv preprint arXiv:230107784.

    Google Scholar

    Bai , Q., Agarwal , M. & Aggarwal , V. 2021. Joint optimization of multi-objective reinforcement learning with policy gradient based algorithm. arXiv preprint arXiv: 210514125.10.1613/jair.1.13981

    Google Scholar

    Basaklar , T., Gumussoy , S. & Ogras , U. Y. 2022. Pd-morl: Preference-driven multi-objective reinforcement learning algorithm. arXiv preprint arXiv:220807914.

    Google Scholar

    Bryce , D., Cushing , W. & Kambhampati , S. 2007. Probabilistic Planning Is Multi-Objective. Arizona State University Computer Science and Engineering Technical Report 07-006.

    Google Scholar

    Cai , X. Q., Zhang , P., Zhao , L., Bian , J., Sugiyama , M. & Llorens , A. 2023. Distributional pareto-optimal multi-objective reinforcement learning. In Advances in Neural Information Processing Systems 36, 15593–15613.

    Google Scholar

    Ding , K. 2022. Addressing the issue of stochastic environments and local decision-making in multi-objective reinforcement learning. arXiv preprint arXiv: 221108669.

    Google Scholar

    Dornheim , J. 2022. gTLO: A generalized and non-linear multi-objective deep reinforcement learning approach. arXiv preprint arXiv: 220404988.

    Google Scholar

    Drugan , M. M. & Nowe , A. 2013. Designing multi-objective multi-armed bandits algorithms: A study. In The 2013 International Joint Conference on Neural Networks (IJCNN), 1–8. IEEE.10.1109/IJCNN.2013.6707036

    Google Scholar

    Fan , Z., Peng , N., Tian , M. & Fain , B. 2022. Welfare and fairness in multi-objective reinforcement learning. arXiv preprint arXiv:221201382.

    Google Scholar

    Felten , F., Alegre , L. N., Nowé , A., Bazzan , A. L. C., Talbi , E. G., Danoy , G. & da Silva , B. C. 2023. A toolkit for reliable benchmarking and research in multi-objective reinforcement learning. In Proceedings of the 37th Conference on Neural Information Processing Systems (NeurIPS 2023).

    Google Scholar

    Gábor , Z., Kalmár , Z. & Szepesvári , C. 1998. Multi-criteria reinforcement learning. In ICML, 98, 197–205.

    Google Scholar

    Hayes , C. F., Howley , E. & Mannion , P. 2020. Dynamic thresholded lexicograpic ordering. In Adaptive and Learning Agents Workshop (AAMAS 2020).

    Google Scholar

    Hayes , C. F., Rădulescu , R., Bargiacchi , E., Källström , J., Macfarlane , M., Reymond , M., Verstraeten , T., Zintgraf , L., Dazeley , R., Heintz , F., Howley , E., Irissappane , A., Mannion , P., Nowé , A., Ramos , G., Restelli , M., Vamplew , P. & Roijers , D. 2022b. A practical guide to multi-objective reinforcement learning and planning. In Autonomous Agents and Multi-Agent Systems 36. DOI 10.1007/s10458-022-09552-y10.1007/s10458-022-09552-y

    Google Scholar

    Hayes , C. F., Roijers , D. M., Howley , E. & Mannion , P. 2022a. Multi-objective distributional value iteration. In Adaptive and Learning Agents Workshop (AAMAS 2022).

    Google Scholar

    Huanca-Anquise , C. A., Bazzan , A. L. C. & Tavares , A. R. 2023. Multi-objective, multi-armed bandits: Algorithms for repeated games and application to route choice. Revista de Informática Teórica e Aplicada 30(1), 11–23.10.22456/2175-2745.122929

    Google Scholar

    Issabekov , R. & Vamplew , P. 2012. An empirical comparison of two common multiobjective reinforcement learning algorithms. In AI 2012: Advances in Artificial Intelligence, Thielscher , M. & Zhang , D. (eds). Lecture Notes in Computer Science, 626–636.

    Google Scholar

    Jin , J. & Ma , X. 2017. A multi-objective multi-agent framework for traffic light control. In 2017 11th Asian Control Conference (ASCC), 1199–1204. IEEE.10.1109/ASCC.2017.8287341

    Google Scholar

    Kulkarni , T. D., Saeedi , A., Gautam , S. & Gershman , S. J. 2016. Deep successor reinforcement learning. https://arxiv.org/abs/1606.02396,

    Google Scholar

    Lian , Z., Lv , C. & Lu , W. 2023. Inkjet OLED printing planning based on deep reinforcement learning and reward-based TLO. Journal of Physics: Conference Series, IOP Publishing 2450, 012081.

    Google Scholar

    Lu , H., Herman , D. & Yu , Y. 2023. Multi-objective reinforcement learning: Convexity, stationarity and pareto optimality. In The Eleventh International Conference on Learning Representations.

    Google Scholar

    Machado , M. C., Barreto , A., Precup , D. & Bowling , M. 2023. Temporal abstraction in reinforcement learning with the successor representation. Journal of Machine Learning Research 24(80), 1–69.

    Google Scholar

    Parisi , S., Pirotta , M., Smacchia , N., Bascetta , L. & Restelli , M. 2014. Policy gradient approaches for multi-objective sequential decision making. In 2014 International Joint Conference on Neural Networks (IJCNN), 2323–2330. IEEE.10.1109/IJCNN.2014.6889738

    Google Scholar

    Reymond , M., Bargiacchi , E. & Nowé , A. 2022. Pareto conditioned networks. arXiv preprint arXiv:220405036.

    Google Scholar

    Roijers , D. M., Vamplew , P., Whiteson , S. & Dazeley , R. 2013. A survey of multi-objective sequential decision-making. Journal of Artificial Intelligence Research 48, 67–113.10.1613/jair.3987

    Google Scholar

    Röpke , W., Reymond , M., Mannion , P., Roijers , D. M., Nowé , A. & Rădulescu, R. 2024. Divide and conquer: Provably unveiling the pareto front with multi-objective reinforcement learning. arXiv preprint arXiv: 240207182.

    Google Scholar

    Siddique , U., Weng , P. & Zimmer , M. 2020. Learning fair policies in multi-objective (deep) reinforcement learning with average and discounted rewards. In International Conference on Machine Learning, 8905–8915. PMLR.

    Google Scholar

    Skalse , J., Hammond , L., Griffin , C.& Abate , A. 2022. Lexicographic multi-objective reinforcement learning. arXiv preprint arXiv: 221213769.10.24963/ijcai.2022/476

    Google Scholar

    Sutton , R. S. & Barto , A. G. 2018. Reinforcement Learning: An Introduction. MIT Press.

    Google Scholar

    Sutton , R. S., Precup , D. & Singh , S. 1999. Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence 112(1-2), 181–211.10.1016/S0004-3702(99)00052-1

    Google Scholar

    Tercan , A. 2022. Solving MDPs with Thresholded Lexicographic Ordering Using Reinforcement Learning. PhD thesis, Colorado State University.

    Google Scholar

    Tessler , C., Mankowitz , D. J. & Mannor , S. 2018. Reward constrained policy optimization. arXiv preprint arXiv: 180511074.

    Google Scholar

    Vamplew , P., Dazeley , R., Barker , E. & Kelarev , A. 2009. Constructing stochastic mixture policies for episodic multiobjective reinforcement learning tasks. In AJCAI, 340–349. Springer.10.1007/978-3-642-10439-8_35

    Google Scholar

    Vamplew , P., Dazeley , R., Berry , A., Issabekov , R. & Dekker , E. 2011. Empirical evaluation methods for multiobjective reinforcement learning algorithms. Machine Learning 84(1-2), 51–80.10.1007/s10994-010-5232-5

    Google Scholar

    Vamplew , P., Dazeley , R. & Foale , C. 2017. Softmax exploration strategies for multiobjective reinforcement learning. Neurocomputing 263, 74–86.10.1016/j.neucom.2016.09.141

    Google Scholar

    Vamplew , P., Foale , C., Dazeley , R. & Bignold , A. 2021. Potential-based multiobjective reinforcement learning approaches to low-impact agents for AI safety. Engineering Applications of Artificial Intelligence 100, 104186.10.1016/j.engappai.2021.104186

    Google Scholar

    Vamplew , P., Foale , C. & Dazeley , R. 2022a. The impact of environmental stochasticity on value-based multiobjective reinforcement learning. Neural Computing and Applications 34(3), 1783–1799. DOI 10.1007/s00521-021-05859-110.1007/s00521-021-05859-1 doi: 10.1007/s00521-021-05859-1

    CrossRef   Google Scholar

    Vamplew , P., Foale , C. & Dazeley , R. 2024. Value function interference and greedy action selection in value-based multi-objective reinforcement learning. arXiv preprint arXiv: 240206266.

    Google Scholar

    Vamplew , P., Smith , B. J., Källström , J., Ramos , G., Rădulescu , R., Roijers , D. M., Hayes , C. F., Heintz , F., Mannion , P., Libin , P. J., et al. 2022b. Scalar reward is not enough: A response to Silver, Singh, Precup and Sutton (2021). Autonomous Agents and Multi-Agent Systems 36(2), 41.10.1007/s10458-022-09575-5

    Google Scholar

    Vamplew , P., Yearwood , J., Dazeley , R. & Berry , A. 2008. On the Limitations of Scalarisation for Multi-objective Reinforcement Learning of Pareto Fronts. Springer-Verlag.10.1007/978-3-540-89378-3_37

    Google Scholar

    Van Hasselt , H., Doron , Y., Strub , F., Hessel , M., Sonnerat , N. & Modayil , J. 2018. Deep reinforcement learning and the deadly triad. arXiv preprint arXiv: 181202648.

    Google Scholar

    Van Moffaert , K., Drugan , M. M. & Nowé , A. 2013. Scalarized multi-objective reinforcement learning: Novel design techniques. In 2013 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL).10.1109/ADPRL.2013.6615007

    Google Scholar

    Vincent , M. 2024. Nonlinear scalarization in stochastic multi-objective mdps. Neural Computing and Applications, 1–13. https://link.springer.com/article/10.1007/s00521-024-10504-8#citeas 10.1007/s00521-024-10504-8

    Google Scholar

    Xu , J., Tian , Y., Ma , P., Rus , D., Sueda , S. & Matusik , W. 2020. Prediction-guided multi-objective reinforcement learning for continuous robot control. In International Conference on Machine Learning, 10607–10616. PMLR.

    Google Scholar

    Yang , R., Sun , X. & Narasimhan , K. 2019. A generalized algorithm for multi-objective reinforcement learning and policy adaptation. In Advances in Neural Information Processing Systems 32.

    Google Scholar

  • Cite this article

    Kewen Ding, Peter Vamplew, Cameron Foale, Richard Dazeley. 2025. An empirical investigation of value-based multi-objective reinforcement learning for stochastic environments. The Knowledge Engineering Review 40(1), doi: 10.1017/S0269888925100052
    Kewen Ding, Peter Vamplew, Cameron Foale, Richard Dazeley. 2025. An empirical investigation of value-based multi-objective reinforcement learning for stochastic environments. The Knowledge Engineering Review 40(1), doi: 10.1017/S0269888925100052

Article Metrics

Article views(134) PDF downloads(240)

RESEARCH ARTICLE   Open Access    

An empirical investigation of value-based multi-objective reinforcement learning for stochastic environments

Abstract: Abstract: One common approach to solve multi-objective reinforcement learning (MORL) problems is to extend conventional Q-learning by using vector Q-values in combination with a utility function. However issues can arise with this approach in the context of stochastic environments, particularly when optimising for the scalarised expected reward (SER) criterion. This paper extends prior research, providing a detailed examination of the factors influencing the frequency with which value-based MORL Q-learning algorithms learn the SER-optimal policy for an environment with stochastic state transitions. We empirically examine several variations of the core multi-objective Q-learning algorithm as well as reward engineering approaches and demonstrate the limitations of these methods. In particular, we highlight the critical impact of the noisy Q-value estimates issue on the stability and convergence of these algorithms.

    • For applications where stochastic or non-stationary policies are acceptable, linear scalarisation can be used to find a set of policies on the convex hull of the Pareto front which can then be combined to form an SER-optimal policy (Vamplew et al., 2009; Lu et al., 2023). In this work, we consider only deterministic stationary policies as in some applications these may be be the only acceptable policies.

    • Conventional single-objective RL does not use a scalarisation function, and so the ESR and SER criteria are the same in this context. Similarly the ESR and SER criteria do not result in different policies for an MOMDP when using linear scalarisation.

    • This algorithm was named by the second author in honour of IT pioneer Maurice Moss.

    • We speculated that this might be addressed using a two-phase variant of MOSS which had separate learning and global statistics gathering phases, with the latter based strictly on the agent’s current greedy policy. However this failed to overcome the issues reported here, and so for reasons of space and clarity we have omitted that algorithm from this paper. Full details are available in Ding (2022).

    • We note that our implementation of policy-options as described in Algorithm 4 does in fact learn Q-values for all states rather than just the starting state, although only those of the starting state are ever actually used for action-selection. This was done so as to introduce as few possible changes to the code implementation of MO Q-learning. An alternative, and more efficient, implementation would be to map the MOMDP to a multi-objective multi-armed bandit (MOMAB), where each arm corresponds to a different deterministic policy. This would support the use of specialised MOMAB algorithms (Drugan & Nowe, 2013; Huanca-Anquise et al., 2023). However, it is important to note that these approaches would still suffer from the same fundamental scaling issues as our implementation, as the number of arms equals ${|A|}^{|S|}$.

    • This is an Open Access article, distributed under the terms of the Creative Commons Attribution licence (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted re-use, distribution and reproduction, provided the original article is properly cited.
References (45)
  • About this article
    Cite this article
    Kewen Ding, Peter Vamplew, Cameron Foale, Richard Dazeley. 2025. An empirical investigation of value-based multi-objective reinforcement learning for stochastic environments. The Knowledge Engineering Review 40(1), doi: 10.1017/S0269888925100052
    Kewen Ding, Peter Vamplew, Cameron Foale, Richard Dazeley. 2025. An empirical investigation of value-based multi-objective reinforcement learning for stochastic environments. The Knowledge Engineering Review 40(1), doi: 10.1017/S0269888925100052
  • Catalog

      /

      DownLoad:  Full-Size Img  PowerPoint
      Return
      Return