Instituto de Computação, Universidade Federal Fluminense, Niterói, RJ, Brazil e-mails: nadinemelloni@id.uff.br, plastino@ic.uff.br "/> Departamento de Estatística, Universidade Federal Fluminense, Niterói, RJ, Brazil e-mail: jarrais@id.uff.br "/> School of Computing, University of Kent, Canterbury, Kent, UK e-mail: a.a.freitas@kent.ac.uk "/>
Search
2021 Volume 36
Article Contents
RESEARCH ARTICLE   Open Access    

Is p-value $<$ 0.05 enough? A study on the statistical evaluation of classifiers

More Information
  • Abstract: Statistical significance analysis, based on hypothesis tests, is a common approach for comparing classifiers. However, many studies oversimplify this analysis by simply checking the condition p-value < 0.05, ignoring important concepts such as the effect size and the statistical power of the test. This problem is so worrying that the American Statistical Association has taken a strong stand on the subject, noting that although the p-value is a useful statistical measure, it has been abusively used and misinterpreted. This work highlights problems caused by the misuse of hypothesis tests and shows how the effect size and the power of the test can provide important information for better decision-making. To investigate these issues, we perform empirical studies with different classifiers and 50 datasets, using the Student’s t-test and the Wilcoxon test to compare classifiers. The results show that an isolated p-value analysis can lead to wrong conclusions and that the evaluation of the effect size and the power of the test contributes to a more principled decision-making.
  • 加载中
  • Barros , E. A. C. & Mazucheli , J. 2005. Um estudo sobre o tamanho e poder dos testes t-student e wilcoxon. Acta Scientiarum: Technology 27(1), 23–32.

    Google Scholar

    Benavoli , A., Corani , G., Demšar , J. & Zaffalon , M. 2017. Time for a change: a tutorial for comparing multiple classifiers through bayesian analysis. Journal of Machine Learning Research 18(1), 1–36.

    Google Scholar

    Berben , L., Sereika , S. M. & Engberg , S. 2012. Effect size estimation: methods and examples. International Journal of Nursing Studies 49(8), 1039–1047.

    Google Scholar

    Bertsimas , D. & Dunn , J. 2017. Optimal classification trees. Machine Learning 106(7), 1039–1082.

    Google Scholar

    Breiman , L. 2001. Random forests. Machine Learning 45(1), 5–32.

    Google Scholar

    Bussab , W. O. & Morettin , P. 2010. Estatística Básica, 6a. edição. Editora Saraiva.

    Google Scholar

    Cardoso , D. O., Gama , J. & França , F. M. 2017. Weightless neural networks for open set recognition. Machine Learning 106(9–10), 1547–1567.

    Google Scholar

    Cohen , J. 1988. Statistical Power Analysis for the Behavioral Sciences, 2nd edition. Erlbaum.

    Google Scholar

    Cousins , S. & Taylor , J. S. 2017. High-probability minimax probability machines. Machine Learning 106(6), 863–886.

    Google Scholar

    Cover , T. & Hart , P. 1967. Nearest neighbor pattern classification. IEEE Transactions on Information Theory 13(1), 21–27.

    Google Scholar

    Dheeru , D. & Taniskidou , E. K. 2017. UCI machine learning repository. http://archive.ics.uci.edu/ml.

    Google Scholar

    du Plessis , M. C., Niu , G. & Sugiyama , M. 2017. Class-prior estimation for learning from positive and unlabeled data. Machine Learning 106(4), 463–492.

    Google Scholar

    Fern , E. F. & Monroe , K. B. 1996. Effect-size estimates: issues and problems in interpretation. Journal of Consumer Research 23(2), 89–105.

    Google Scholar

    Fisher , R. A. 1925. Statistical Methods for Research Workers. Springer.

    Google Scholar

    Fritz , C. O., Morris , P. E. & Richler , J. J. 2012. Effect size estimates: current use, calculations, and interpretation. Journal of Experimental Psychology: General 141(1), 2–18.

    Google Scholar

    Gomes , H. M., Bifet , A., Read , J., Barddal , J. P., Enembreck , F., Pfharinger , B., Holmes , G. & Abdessalem , T. 2017. Adaptive random forests for evolving data stream classification. Machine Learning 106(9–10), 1469–1495.

    Google Scholar

    Hair , J. F., Black , W. C., Babin , B. J., Anderson , R. E. & Tatham , R. L. 2009. Análise multivariada de dados. Bookman Editora.

    Google Scholar

    Hearst , M. A., Dumais , S. T., Osuna , E., Platt , J. & Scholkopf , B. 1998. Support vector machines. IEEE Intelligent Systems and Their Applications 13(4), 18–28.

    Google Scholar

    Huang , K. H. & Lin , H. T. 2017. Cost-sensitive label embedding for multi-label classification. Machine Learning 106(9–10), 1725–1746.

    Google Scholar

    Japkowicz , N. & Shah , M. 2011. Evaluating Learning Algorithms: A Classification Perspective. Cambridge University Press.

    Google Scholar

    Júnior , P. R. M., de Souza , R. M., Werneck , R. d. O., Stein , B. V., Pazinato , D. V., de Almeida , W. R., Penatti , O. A., Torres , R. d. S. & Rocha , A. 2017. Nearest neighbors distance ratio open-set classifier. Machine Learning 106(3), 359–386.

    Google Scholar

    Kim , D. & Oh , A. 2017. Hierarchical dirichlet scaling process. Machine Learning 106(3), 387–418.

    Google Scholar

    Kline , R. B. 2004. Beyond Significance Testing: Reforming Data Analysis Methods in Behavioral Research. American Psychological Association.

    Google Scholar

    Kotłowski , W. & Dembczyński , K. 2017. Surrogate regret bounds for generalized classification performance metrics. Machine Learning 106(4), 549–572.

    Google Scholar

    Krijthe , J. H. & Loog , M. 2017. Projected estimators for robust semi-supervised classification. Machine Learning 106(7), 993–1008.

    Google Scholar

    Langley , P., Iba , W., Thompson , K. 1992. An analysis of Bayesian classifiers. In Proceedings of the Tenth National Conference on Artificial Intelligence (AAAI-92), California, AAAI Press, 90, 223–228.

    Google Scholar

    Mena , D., Montañés , E., Quevedo , J. R. & Del Coz , J. J. 2017. A family of admissible heuristics for a* to perform inference in probabilistic classifier chains. Machine Learning 106(1), 143–169.

    Google Scholar

    ML Journal . 2017. Machine Learning 106(1–12). https://link.springer.com/journal/10994/106/1

    Google Scholar

    Nakagawa , S. & Cuthill , I. C. 2007. Effect size, confidence interval and statistical significance: a practical guide for biologists. Biological Reviews 82(4), 591–605.

    Google Scholar

    Neumann , N. M., Plastino , A., Junior , J. A. P. & Freitas , A. A. 2018. Is p-value< 0.05 enough? two case studies in classifiers evaluation (in Portuguese). In Anais do XV Encontro Nacional de Inteligência Artificial e Computacional, SBC, 94–103.

    Google Scholar

    Osojnik , A., Panov , P. & Džeroski , S. 2017. Multi-label classification via multi-target regression on data streams. Machine Learning 106(6), 745–770.

    Google Scholar

    Snyder , P. & Lawson , S. 1993. Evaluating results using corrected and uncorrected effect size estimates. The Journal of Experimental Education 61(4), 334–349.

    Google Scholar

    Sullivan , G. M. & Feinn , R. 2012. Using effect size-or why the p-value is not enough. Journal of Graduate Medical Education 4(3), 279–282.

    Google Scholar

    Suzumura , S., Ogawa , K., Sugiyama , M., Karasuyama , M. & Takeuchi , I. 2017. Homotopy continuation approaches for robust SV classification and regression. Machine Learning 106(7), 1009–1038.

    Google Scholar

    Tomczak , M. & Tomczak , E. 2014. The need to report effect size estimates revisited. an overview of some recommended measures of effect size. Trends in Sport Sciences 21(1), 19–25.

    Google Scholar

    Wasserstein , R. L. & Lazar , N. A. 2016. The ASA’s statement on p-values: context, process, and purpose. The American Statistician 70, 129–133.

    Google Scholar

    Witten , I. H., Frank , E., Hall , M. A. & Pal , C. J. 2016. Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann.

    Google Scholar

    Wu , Y. P. & Lin , H. T. 2017. Progressive random k-labelsets for cost-sensitive multi-label classification. Machine Learning 106(5), 671–694.

    Google Scholar

    Xuan , J., Lu , J., Zhang , G., Da Xu , R. Y. & Luo , X. 2017. A Bayesian nonparametric model for multi-label learning. Machine Learning 106(11), 1787–1815.

    Google Scholar

    Yu , F. & Zhang , M. L. 2017. Maximum margin partial label learning. Machine Learning 106(4), 573–593.

    Google Scholar

    Zaidi , N. A., Webb , G. I., Carman , M. J., Petitjean , F., Buntine , W., Hynes , M. & De Sterck , H. 2017. Efficient parameter learning of Bayesian network classifiers. Machine Learning 106(9–10), 1289–1329.

    Google Scholar

  • Cite this article

    Nadine M. Neumann, Alexandre Plastino, Jony A. Pinto Junior, Alex A. Freitas. 2021. Is p-value $ 0.05 enough? A study on the statistical evaluation of classifiers. The Knowledge Engineering Review 36(1), doi: 10.1017/S0269888920000417
    Nadine M. Neumann, Alexandre Plastino, Jony A. Pinto Junior, Alex A. Freitas. 2021. Is p-value $<$ 0.05 enough? A study on the statistical evaluation of classifiers. The Knowledge Engineering Review 36(1), doi: 10.1017/S0269888920000417

Article Metrics

Article views(51) PDF downloads(19)

RESEARCH ARTICLE   Open Access    

Is p-value $<$ 0.05 enough? A study on the statistical evaluation of classifiers

Abstract: Abstract: Statistical significance analysis, based on hypothesis tests, is a common approach for comparing classifiers. However, many studies oversimplify this analysis by simply checking the condition p-value < 0.05, ignoring important concepts such as the effect size and the statistical power of the test. This problem is so worrying that the American Statistical Association has taken a strong stand on the subject, noting that although the p-value is a useful statistical measure, it has been abusively used and misinterpreted. This work highlights problems caused by the misuse of hypothesis tests and shows how the effect size and the power of the test can provide important information for better decision-making. To investigate these issues, we perform empirical studies with different classifiers and 50 datasets, using the Student’s t-test and the Wilcoxon test to compare classifiers. The results show that an isolated p-value analysis can lead to wrong conclusions and that the evaluation of the effect size and the power of the test contributes to a more principled decision-making.

    • © The Author(s), 2020. Published by Cambridge University Press2020Cambridge University Press
References (41)
  • About this article
    Cite this article
    Nadine M. Neumann, Alexandre Plastino, Jony A. Pinto Junior, Alex A. Freitas. 2021. Is p-value $ 0.05 enough? A study on the statistical evaluation of classifiers. The Knowledge Engineering Review 36(1), doi: 10.1017/S0269888920000417
    Nadine M. Neumann, Alexandre Plastino, Jony A. Pinto Junior, Alex A. Freitas. 2021. Is p-value $<$ 0.05 enough? A study on the statistical evaluation of classifiers. The Knowledge Engineering Review 36(1), doi: 10.1017/S0269888920000417
  • Catalog

      /

      DownLoad:  Full-Size Img  PowerPoint
      Return
      Return