Search
2013 Volume 28
Article Contents
RESEARCH ARTICLE   Open Access    

A survey on metrics for the evaluation of user simulations

More Information
  • Abstract: User simulation is an important research area in the field of spoken dialogue systems (SDSs) because collecting and annotating real human–machine interactions is often expensive and time-consuming. However, such data are generally required for designing, training and assessing dialogue systems. User simulations are especially needed when using machine learning methods for optimizing dialogue management strategies such as Reinforcement Learning, where the amount of data necessary for training is larger than existing corpora. The quality of the user simulation is therefore of crucial importance because it dramatically influences the results in terms of SDS performance analysis and the learnt strategy. Assessment of the quality of simulated dialogues and user simulation methods is an open issue and, although assessment metrics are required, there is no commonly adopted metric. In this paper, we give a survey of User Simulations Metrics in the literature, propose some extensions and discuss these metrics in terms of a list of desired features.
  • 加载中
  • Ai H., Litman D.2008. Assessing dialog system user simulation evaluation measures using human judges. In Proceedings of the 46th Meeting of the Association for Computational Linguistics, Columbus, OH, USA, 622–629.

    Google Scholar

    Ai H., Litman D.2009. Setting up user action probabilities in user simulations for dialog system development. In Proceedings of the 47th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL), Singapore.

    Google Scholar

    Anderson T.1962. On the distribution of the two-sample Cramér-von Mises criterion. Annals of Mathematical Statistics33(3), 1148–1159.

    Google Scholar

    Carletta J.1996. Assessing agreement on classification tasks: the kappa statistic. Computational Linguistics22(2), 249–254.

    Google Scholar

    Chandramohan S., Geist M., Lefèvre F., Pietquin O.2011. User Simulation in Dialogue Systems using Inverse Reinforcement Learning. In Proceedings of Interspeech 2011, Florence, Italy.

    Google Scholar

    Cramer H.1928. On the composition of elementary errors. Second paper: statistical applications. Skandinavisk Aktuarietidskrift11, 171–180.

    Google Scholar

    Cuayahuitl H., Renals S., Lemon O., Shimodaira H.2005. Human–computer dialogue simulation using hidden Markov models. In Proceedings of ASRU, 290–295. Cancun, Mexico

    Google Scholar

    Cuayahuitl H.2009. Hierarchical Reinforcement Learning for Spoken Dialogue Systems. PhD thesis, University of Edinburgh, UK.

    Google Scholar

    Doddington G.2002. Automatic evaluation of machine translation quality using n-gram cooccurrence statistics. In Proceedings of the Human Language Technology Conference (HLT), San Diego, CA, USA, 128–132.

    Google Scholar

    Eckert W., Levin E., Pieraccini R.1997. User modeling for spoken dialogue system evaluation. In Proceedings of ASRU'97. Santa Barbara, USA.

    Google Scholar

    Frampton M., Lemon O.2010. Recent research advances in reinforcement learning in spoken dialogue systems. The Knowledge Engineering Review24(4), 375–408.

    Google Scholar

    Georgila K., Henderson J., Lemon O.2005. Learning user simulations for information state update dialogue systems. In Proceedings of Interspeech 2005. Lisboa, Portugal.

    Google Scholar

    Georgila K., Henderson J., Lemon O.2006. User simulation for spoken dialogue systems: learning and evaluation. In Proceedings of Interspeech'06. Pittsburg, USA.

    Google Scholar

    Janarthanam S., Lemon O.2009a. A data-driven method for adaptive referring expression generation in automated dialogue systems: maximising expected utility. In Proceedings of PRE-COGSCI 09. Boston, USA.

    Google Scholar

    Janarthanam S., Lemon O.2009b. A two-tier user simulation model for reinforcement learning of adaptive referring expression generation policies. In Proceedings of SIGDIAL. London, UK.

    Google Scholar

    Janarthanam S., Lemon O.2009c. Learning adaptive referring expression generation policies for spoken dialogue systems using reinforcement learning. In Proceedings of SEMDIAL. Stockholm, Sweden.

    Google Scholar

    Janarthanam S., Lemon O.2009d. A Wizard-of-Oz environment to study referring expression generation in a situated spoken dialogue task. In Proceedings of ENLG, 2009. Athens, Greece.

    Google Scholar

    Jung S., Lee C., Kim K., Jeong M., Lee G. G.2009. Data-driven user simulation for automated evaluation of spoken dialog systems. Computer Speech & Language23(4), 479–509.

    Google Scholar

    Kullback S., Leiber R.1951. On information and sufficiency. Annals of Mathematical Statistics22, 79–86.

    Google Scholar

    Levin E., Pieraccini R., Eckert W.1997. Learning dialogue strategies within the Markov decision process framework. In Proceedings of ASRU'97. Santa Barbara, USA.

    Google Scholar

    Levin E., Pieraccini R., Eckert W.2000. A stochastic model of human–machine interaction for learning dialog strategies. IEEE Transactions on Speech and Audio Processing8(1), 11–23.

    Google Scholar

    López-Cózar R., de la Torre A., Segura J., Rubio A.2003. Assesment of dialogue systems by means of a new simulation technique. Speech Communication40(3), 387–407.

    Google Scholar

    Ng A. Y., Russell S.2000. Algorithms for inverse reinforcement learning. In Proceedings of 17th International Conference on Machine Learning. Morgan Kaufmann, 663–670.

    Google Scholar

    Paek T., Pieraccini R.2008. Automating spoken dialogue management design using machine learning: an industry perspective. Speech Communication50, 716–729.

    Google Scholar

    Papineni K., Roukos S., Ward T., Zhu W.2002. BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, 311–318.

    Google Scholar

    Pietquin O., Dutoit T.2006. A probabilistic framework for dialog simulation and optimal strategy learning. IEEE Transactions on Audio, Speech and Language Processing14(2), 589–599.

    Google Scholar

    Pietquin O., Rossignol S., Ianotto M.2009. Training Bayesian networks for realistic man–machine spoken dialogue simulation. In Proceedings of the 1st International Workshop on Spoken Dialogue Systems Technology, Irsee, Germany, 4.

    Google Scholar

    Pietquin O.2004. A Framework for Unsupervised Learning of Dialogue Strategies. PhD thesis, Faculté Polytechnique de Mons (FPMs), Belgium.

    Google Scholar

    Pietquin O.2006. Consistent goal-directed user model for realisitc man–machine task-oriented spoken dialogue simulation. In Proceedingsof ICME'06. Toronto, Canada.

    Google Scholar

    Rieser V.2008. Bootstrapping Reinforcement Learning-based Dialogue Strategies from Wizard-of-Oz data. PhD thesis, Saarland University, Department of Computational Linguistics.

    Google Scholar

    Rieser V., Lemon O.2006. Simulations for learning dialogue strategies. In Proceedings of Interspeech 2006, Pittsburg, USA.

    Google Scholar

    Rieser V., Lemon O.2008. Learning effective multimodal dialogue strategies from Wizard-of-Oz data: bootstrapping and evaluation. In Proceedings of ACL, 2008. Colombus, Ohio.

    Google Scholar

    Russell S.1998. Learning agents for uncertain environments (extended abstract). In COLT’ 98: Proceedings of the 11th Annual Conference on Computational Learning Theory. ACM, 101–103. Madisson, USA.

    Google Scholar

    Schatzmann J., Georgila K., Young S.2005a. Quantitative evaluation of user simulation techniques for spoken dialogue systems. In Proceedings of SIGdial'05. Lisbon, Portugal.

    Google Scholar

    Schatzmann J., Stuttle M. N., Weilhammer K., Young S.2005b. Effects of the user model on simulation-based learning of dialogue strategies. In Proceedings of ASRU'05. Cancun, Mexico.

    Google Scholar

    Schatzmann J., Thomson B., Weilhammer K., Ye H., Young S.2007a. Agenda-based user simulation for bootstrapping a POMDP dialogue system. In Proceedings of ICASSP'07. Honolulu, USA.

    Google Scholar

    Schatzmann J., Thomson B., Young S.2007b. Statistical user simulation with a hidden agenda. In Proceedings of SigDial'07. Anvers, Belgium.

    Google Scholar

    Schatzmann J., Weilhammer K., Stuttle M., Young S.2006. A survey of statistical user simulation techniques for reinforcement-learning of dialogue management strategies. The Knowledge Engineering Review21(2), 97–126.

    Google Scholar

    Scheffler K., Young S.2001. Corpus-based dialogue simulation for automatic strategy learning and evaluation. In Proceedings of NAACL Workshop on Adaptation in Dialogue Systems. Pittsburgh, PA, USA.

    Google Scholar

    Singh S., Kearns M., Litman D., Walker M.1999. Reinforcement learning for spoken dialogue systems. In Proceedings of the NIPS'99. Vancouver, Canada.

    Google Scholar

    Sutton R., Barto A.1998. Reinforcement Learning: An Introduction. MIT Press.

    Google Scholar

    van Rijsbergen C. J.1979. Information Retrieval, second edn.Butterworths.

    Google Scholar

    Walker M., Hindle D., Fromer J., Fabbrizio G. D., Mestel C.1997a. Evaluating competing agent strategies for a voice email agent. In Proceedings of the 5th European Conference on Speech Communication and Technology (Eurospeech'97), Rhodes, Greece.

    Google Scholar

    Walker M., Litman D., Kamm C., Abella A.1997b. Paradise: a framework for evaluating spoken dialogue agents. In Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics, 271–280. Madrid, Spain.

    Google Scholar

    Williams J. D., Young S.2007. Partially observable Markov decision processes for spoken dialog systems. Computer Speech and Language21(2), 393–422.

    Google Scholar

    Williams J., Poupart P., Young S.2005. Partially Observable Markov Decision Processes with Continuous Observations for Dialogue Management. In Proceedings of the SigDial Workshop (SigDial'06). Sydney, Australia.

    Google Scholar

    Williams J.2008. Evaluating user simulations with the Cramer-von Mises divergence. Speech Communication50, 829–846.

    Google Scholar

    Zukerman I., Albrecht D.2001. Predictive statistical models for user modeling. User Modeling and User-Adapted Interaction11, 5–18.

    Google Scholar

  • Cite this article

    Olivier Pietquin, Helen Hastie. 2013. A survey on metrics for the evaluation of user simulations. The Knowledge Engineering Review 28(1)59−73, doi: 10.1017/S0269888912000343
    Olivier Pietquin, Helen Hastie. 2013. A survey on metrics for the evaluation of user simulations. The Knowledge Engineering Review 28(1)59−73, doi: 10.1017/S0269888912000343

Article Metrics

Article views(17) PDF downloads(80)

Other Articles By Authors

RESEARCH ARTICLE   Open Access    

A survey on metrics for the evaluation of user simulations

The Knowledge Engineering Review  28 2013, 28(1): 59−73  |  Cite this article

Abstract: Abstract: User simulation is an important research area in the field of spoken dialogue systems (SDSs) because collecting and annotating real human–machine interactions is often expensive and time-consuming. However, such data are generally required for designing, training and assessing dialogue systems. User simulations are especially needed when using machine learning methods for optimizing dialogue management strategies such as Reinforcement Learning, where the amount of data necessary for training is larger than existing corpora. The quality of the user simulation is therefore of crucial importance because it dramatically influences the results in terms of SDS performance analysis and the learnt strategy. Assessment of the quality of simulated dialogues and user simulation methods is an open issue and, although assessment metrics are required, there is no commonly adopted metric. In this paper, we give a survey of User Simulations Metrics in the literature, propose some extensions and discuss these metrics in terms of a list of desired features.

    • This research has received funding from the European Community's Seventh Framework Programme (FP72007-2013) under grant agreement no. 216594 (CLASSIC project: www. classicproject.org). The authors thank the referees, as well as Oliver Lemon and Paul Crook for their help and comments.

    • Notice that this naming is generally misleading since user modelling is more about inferring user's mental state than about producing consistent behaviours, which is the real aim of user simulation.

    • Notice that Zukerman and Albrecht (2001) is more about user modelling than user simulation but the distinction is similar to Schatzmann et al. (2006).

    • Copyright © Cambridge University Press 20122012Cambridge University Press
References (48)
  • About this article
    Cite this article
    Olivier Pietquin, Helen Hastie. 2013. A survey on metrics for the evaluation of user simulations. The Knowledge Engineering Review 28(1)59−73, doi: 10.1017/S0269888912000343
    Olivier Pietquin, Helen Hastie. 2013. A survey on metrics for the evaluation of user simulations. The Knowledge Engineering Review 28(1)59−73, doi: 10.1017/S0269888912000343
  • Catalog

      /

      DownLoad:  Full-Size Img  PowerPoint
      Return
      Return