A survey on metrics for the evaluation of user simulations

Olivier Pietquin; Helen Hastie; Olivier Pietquin; Helen Hastie

doi:10.1017/S0269888912000343

2013 Volume 28

Article Contents

Next Previous

RESEARCH ARTICLE Open Access

A survey on metrics for the evaluation of user simulations

1.
UMI 2958 (GeorgiaTech – CNRS)
2.
School of Mathematical and Computer Sciences

More Information

Published online: 28 November 2012
The Knowledge Engineering Review 28, Article number: 10.1017/S0269888912000343 (2013) | Cite this article

Abstract

Abstract: User simulation is an important research area in the field of spoken dialogue systems (SDSs) because collecting and annotating real human–machine interactions is often expensive and time-consuming. However, such data are generally required for designing, training and assessing dialogue systems. User simulations are especially needed when using machine learning methods for optimizing dialogue management strategies such as Reinforcement Learning, where the amount of data necessary for training is larger than existing corpora. The quality of the user simulation is therefore of crucial importance because it dramatically influences the results in terms of SDS performance analysis and the learnt strategy. Assessment of the quality of simulated dialogues and user simulation methods is an open issue and, although assessment metrics are required, there is no commonly adopted metric. In this paper, we give a survey of User Simulations Metrics in the literature, propose some extensions and discuss these metrics in terms of a list of desired features.
Rights and permissions
Copyright © Cambridge University Press 20122012Cambridge University Press

References

Ai H., Litman D.2008. Assessing dialog system user simulation evaluation measures using human judges. In Proceedings of the 46th Meeting of the Association for Computational Linguistics, Columbus, OH, USA, 622–629.

Google Scholar

Ai H., Litman D.2009. Setting up user action probabilities in user simulations for dialog system development. In Proceedings of the 47th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL), Singapore.

Google Scholar

Anderson T.1962. On the distribution of the two-sample Cramér-von Mises criterion. Annals of Mathematical Statistics33(3), 1148–1159.

Google Scholar

Carletta J.1996. Assessing agreement on classification tasks: the kappa statistic. Computational Linguistics22(2), 249–254.

Google Scholar

Chandramohan S., Geist M., Lefèvre F., Pietquin O.2011. User Simulation in Dialogue Systems using Inverse Reinforcement Learning. In Proceedings of Interspeech 2011, Florence, Italy.

Google Scholar

Cramer H.1928. On the composition of elementary errors. Second paper: statistical applications. Skandinavisk Aktuarietidskrift11, 171–180.

Google Scholar

Cuayahuitl H., Renals S., Lemon O., Shimodaira H.2005. Human–computer dialogue simulation using hidden Markov models. In Proceedings of ASRU, 290–295. Cancun, Mexico

Google Scholar

Cuayahuitl H.2009. Hierarchical Reinforcement Learning for Spoken Dialogue Systems. PhD thesis, University of Edinburgh, UK.

Google Scholar

Doddington G.2002. Automatic evaluation of machine translation quality using n-gram cooccurrence statistics. In Proceedings of the Human Language Technology Conference (HLT), San Diego, CA, USA, 128–132.

Google Scholar

Eckert W., Levin E., Pieraccini R.1997. User modeling for spoken dialogue system evaluation. In Proceedings of ASRU'97. Santa Barbara, USA.

Google Scholar

Frampton M., Lemon O.2010. Recent research advances in reinforcement learning in spoken dialogue systems. The Knowledge Engineering Review24(4), 375–408.

Google Scholar

Georgila K., Henderson J., Lemon O.2005. Learning user simulations for information state update dialogue systems. In Proceedings of Interspeech 2005. Lisboa, Portugal.

Google Scholar

Georgila K., Henderson J., Lemon O.2006. User simulation for spoken dialogue systems: learning and evaluation. In Proceedings of Interspeech'06. Pittsburg, USA.

Google Scholar

Janarthanam S., Lemon O.2009a. A data-driven method for adaptive referring expression generation in automated dialogue systems: maximising expected utility. In Proceedings of PRE-COGSCI 09. Boston, USA.

Google Scholar

Janarthanam S., Lemon O.2009b. A two-tier user simulation model for reinforcement learning of adaptive referring expression generation policies. In Proceedings of SIGDIAL. London, UK.

Google Scholar

Janarthanam S., Lemon O.2009c. Learning adaptive referring expression generation policies for spoken dialogue systems using reinforcement learning. In Proceedings of SEMDIAL. Stockholm, Sweden.

Google Scholar

Janarthanam S., Lemon O.2009d. A Wizard-of-Oz environment to study referring expression generation in a situated spoken dialogue task. In Proceedings of ENLG, 2009. Athens, Greece.

Google Scholar

Jung S., Lee C., Kim K., Jeong M., Lee G. G.2009. Data-driven user simulation for automated evaluation of spoken dialog systems. Computer Speech & Language23(4), 479–509.

Google Scholar

Kullback S., Leiber R.1951. On information and sufficiency. Annals of Mathematical Statistics22, 79–86.

Google Scholar

Levin E., Pieraccini R., Eckert W.1997. Learning dialogue strategies within the Markov decision process framework. In Proceedings of ASRU'97. Santa Barbara, USA.

Google Scholar

Levin E., Pieraccini R., Eckert W.2000. A stochastic model of human–machine interaction for learning dialog strategies. IEEE Transactions on Speech and Audio Processing8(1), 11–23.

Google Scholar

López-Cózar R., de la Torre A., Segura J., Rubio A.2003. Assesment of dialogue systems by means of a new simulation technique. Speech Communication40(3), 387–407.

Google Scholar

Ng A. Y., Russell S.2000. Algorithms for inverse reinforcement learning. In Proceedings of 17th International Conference on Machine Learning. Morgan Kaufmann, 663–670.

Google Scholar

Paek T., Pieraccini R.2008. Automating spoken dialogue management design using machine learning: an industry perspective. Speech Communication50, 716–729.

Google Scholar

Papineni K., Roukos S., Ward T., Zhu W.2002. BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, 311–318.

Google Scholar

Pietquin O., Dutoit T.2006. A probabilistic framework for dialog simulation and optimal strategy learning. IEEE Transactions on Audio, Speech and Language Processing14(2), 589–599.

Google Scholar

Pietquin O., Rossignol S., Ianotto M.2009. Training Bayesian networks for realistic man–machine spoken dialogue simulation. In Proceedings of the 1st International Workshop on Spoken Dialogue Systems Technology, Irsee, Germany, 4.

Google Scholar

Pietquin O.2004. A Framework for Unsupervised Learning of Dialogue Strategies. PhD thesis, Faculté Polytechnique de Mons (FPMs), Belgium.

Google Scholar

Pietquin O.2006. Consistent goal-directed user model for realisitc man–machine task-oriented spoken dialogue simulation. In Proceedingsof ICME'06. Toronto, Canada.

Google Scholar

Rieser V.2008. Bootstrapping Reinforcement Learning-based Dialogue Strategies from Wizard-of-Oz data. PhD thesis, Saarland University, Department of Computational Linguistics.

Google Scholar

Rieser V., Lemon O.2006. Simulations for learning dialogue strategies. In Proceedings of Interspeech 2006, Pittsburg, USA.

Google Scholar

Rieser V., Lemon O.2008. Learning effective multimodal dialogue strategies from Wizard-of-Oz data: bootstrapping and evaluation. In Proceedings of ACL, 2008. Colombus, Ohio.

Google Scholar

Russell S.1998. Learning agents for uncertain environments (extended abstract). In COLT’ 98: Proceedings of the 11th Annual Conference on Computational Learning Theory. ACM, 101–103. Madisson, USA.

Google Scholar

Schatzmann J., Georgila K., Young S.2005a. Quantitative evaluation of user simulation techniques for spoken dialogue systems. In Proceedings of SIGdial'05. Lisbon, Portugal.

Google Scholar

Schatzmann J., Stuttle M. N., Weilhammer K., Young S.2005b. Effects of the user model on simulation-based learning of dialogue strategies. In Proceedings of ASRU'05. Cancun, Mexico.

Google Scholar

Schatzmann J., Thomson B., Weilhammer K., Ye H., Young S.2007a. Agenda-based user simulation for bootstrapping a POMDP dialogue system. In Proceedings of ICASSP'07. Honolulu, USA.

Google Scholar

Schatzmann J., Thomson B., Young S.2007b. Statistical user simulation with a hidden agenda. In Proceedings of SigDial'07. Anvers, Belgium.

Google Scholar

Schatzmann J., Weilhammer K., Stuttle M., Young S.2006. A survey of statistical user simulation techniques for reinforcement-learning of dialogue management strategies. The Knowledge Engineering Review21(2), 97–126.

Google Scholar

Scheffler K., Young S.2001. Corpus-based dialogue simulation for automatic strategy learning and evaluation. In Proceedings of NAACL Workshop on Adaptation in Dialogue Systems. Pittsburgh, PA, USA.

Google Scholar

Singh S., Kearns M., Litman D., Walker M.1999. Reinforcement learning for spoken dialogue systems. In Proceedings of the NIPS'99. Vancouver, Canada.

Google Scholar

Sutton R., Barto A.1998. Reinforcement Learning: An Introduction. MIT Press.

Google Scholar

van Rijsbergen C. J.1979. Information Retrieval, second edn.Butterworths.

Google Scholar

Walker M., Hindle D., Fromer J., Fabbrizio G. D., Mestel C.1997a. Evaluating competing agent strategies for a voice email agent. In Proceedings of the 5th European Conference on Speech Communication and Technology (Eurospeech'97), Rhodes, Greece.

Google Scholar

Walker M., Litman D., Kamm C., Abella A.1997b. Paradise: a framework for evaluating spoken dialogue agents. In Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics, 271–280. Madrid, Spain.

Google Scholar

Williams J. D., Young S.2007. Partially observable Markov decision processes for spoken dialog systems. Computer Speech and Language21(2), 393–422.

Google Scholar

Williams J., Poupart P., Young S.2005. Partially Observable Markov Decision Processes with Continuous Observations for Dialogue Management. In Proceedings of the SigDial Workshop (SigDial'06). Sydney, Australia.

Google Scholar

Williams J.2008. Evaluating user simulations with the Cramer-von Mises divergence. Speech Communication50, 829–846.

Google Scholar

Zukerman I., Albrecht D.2001. Predictive statistical models for user modeling. User Modeling and User-Adapted Interaction11, 5–18.

Google Scholar

About this article

Cite this article

Olivier Pietquin, Helen Hastie. 2013. A survey on metrics for the evaluation of user simulations. The Knowledge Engineering Review 28(1)59−73, doi: 10.1017/S0269888912000343

Olivier Pietquin, Helen Hastie. 2013. A survey on metrics for the evaluation of user simulations. The Knowledge Engineering Review 28(1)59−73, doi: 10.1017/S0269888912000343

Download PDF

Article Metrics

Article views(17) PDF downloads(80)

{{lists.name}}

A survey on metrics for the evaluation of user simulations

Abstract

Rights and permissions

References

About this article

Cite this article

Article Metrics

Access History

Other Articles By Authors