Is <i>p</i>-value
                    <inline-formula>
                        <alternatives><inline-graphic mime-subtype="png" xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="S0269888920000417_inline1.png"/><tex-math>
                                $<$
                            </tex-math></alternatives>
                    </inline-formula> 0.05 enough? <i>A study on the statistical evaluation of classifiers</i>

Nadine M. Neumann; Alexandre Plastino; Jony A. Pinto Junior; Alex A. Freitas; Nadine M. Neumann; Alexandre Plastino; Jony A. Pinto Junior; Alex A. Freitas

doi:10.1017/S0269888920000417

2021 Volume 36

Article Contents

Next Previous

RESEARCH ARTICLE Open Access

Is p-value $<$ 0.05 enough? A study on the statistical evaluation of classifiers

¹Instituto de Computação, Universidade Federal Fluminense, Niterói, RJ, Brazil e-mails: nadinemelloni@id.uff.br, plastino@ic.uff.br
²Departamento de Estatística, Universidade Federal Fluminense, Niterói, RJ, Brazil e-mail: jarrais@id.uff.br
³School of Computing, University of Kent, Canterbury, Kent, UK e-mail: a.a.freitas@kent.ac.uk

More Information

Received: 27 April 2020
Revised: 15 October 2020
Accepted: 16 October 2020
Published online: 27 November 2020
The Knowledge Engineering Review 36, Article number: e1 (2021) | Cite this article

Abstract

Abstract: Statistical significance analysis, based on hypothesis tests, is a common approach for comparing classifiers. However, many studies oversimplify this analysis by simply checking the condition p-value < 0.05, ignoring important concepts such as the effect size and the statistical power of the test. This problem is so worrying that the American Statistical Association has taken a strong stand on the subject, noting that although the p-value is a useful statistical measure, it has been abusively used and misinterpreted. This work highlights problems caused by the misuse of hypothesis tests and shows how the effect size and the power of the test can provide important information for better decision-making. To investigate these issues, we perform empirical studies with different classifiers and 50 datasets, using the Student’s t-test and the Wilcoxon test to compare classifiers. The results show that an isolated p-value analysis can lead to wrong conclusions and that the evaluation of the effect size and the power of the test contributes to a more principled decision-making.
Rights and permissions
© The Author(s), 2020. Published by Cambridge University Press2020Cambridge University Press

References

Barros , E. A. C. & Mazucheli , J. 2005. Um estudo sobre o tamanho e poder dos testes t-student e wilcoxon. Acta Scientiarum: Technology 27(1), 23–32.

Google Scholar

Benavoli , A., Corani , G., Demšar , J. & Zaffalon , M. 2017. Time for a change: a tutorial for comparing multiple classifiers through bayesian analysis. Journal of Machine Learning Research 18(1), 1–36.

Google Scholar

Berben , L., Sereika , S. M. & Engberg , S. 2012. Effect size estimation: methods and examples. International Journal of Nursing Studies 49(8), 1039–1047.

Google Scholar

Bertsimas , D. & Dunn , J. 2017. Optimal classification trees. Machine Learning 106(7), 1039–1082.

Google Scholar

Breiman , L. 2001. Random forests. Machine Learning 45(1), 5–32.

Google Scholar

Bussab , W. O. & Morettin , P. 2010. Estatística Básica, 6a. edição. Editora Saraiva.

Google Scholar

Cardoso , D. O., Gama , J. & França , F. M. 2017. Weightless neural networks for open set recognition. Machine Learning 106(9–10), 1547–1567.

Google Scholar

Cohen , J. 1988. Statistical Power Analysis for the Behavioral Sciences, 2nd edition. Erlbaum.

Google Scholar

Cousins , S. & Taylor , J. S. 2017. High-probability minimax probability machines. Machine Learning 106(6), 863–886.

Google Scholar

Cover , T. & Hart , P. 1967. Nearest neighbor pattern classification. IEEE Transactions on Information Theory 13(1), 21–27.

Google Scholar

Dheeru , D. & Taniskidou , E. K. 2017. UCI machine learning repository. http://archive.ics.uci.edu/ml.

Google Scholar

du Plessis , M. C., Niu , G. & Sugiyama , M. 2017. Class-prior estimation for learning from positive and unlabeled data. Machine Learning 106(4), 463–492.

Google Scholar

Fern , E. F. & Monroe , K. B. 1996. Effect-size estimates: issues and problems in interpretation. Journal of Consumer Research 23(2), 89–105.

Google Scholar

Fisher , R. A. 1925. Statistical Methods for Research Workers. Springer.

Google Scholar

Fritz , C. O., Morris , P. E. & Richler , J. J. 2012. Effect size estimates: current use, calculations, and interpretation. Journal of Experimental Psychology: General 141(1), 2–18.

Google Scholar

Gomes , H. M., Bifet , A., Read , J., Barddal , J. P., Enembreck , F., Pfharinger , B., Holmes , G. & Abdessalem , T. 2017. Adaptive random forests for evolving data stream classification. Machine Learning 106(9–10), 1469–1495.

Google Scholar

Hair , J. F., Black , W. C., Babin , B. J., Anderson , R. E. & Tatham , R. L. 2009. Análise multivariada de dados. Bookman Editora.

Google Scholar

Hearst , M. A., Dumais , S. T., Osuna , E., Platt , J. & Scholkopf , B. 1998. Support vector machines. IEEE Intelligent Systems and Their Applications 13(4), 18–28.

Google Scholar

Huang , K. H. & Lin , H. T. 2017. Cost-sensitive label embedding for multi-label classification. Machine Learning 106(9–10), 1725–1746.

Google Scholar

Japkowicz , N. & Shah , M. 2011. Evaluating Learning Algorithms: A Classification Perspective. Cambridge University Press.

Google Scholar

Júnior , P. R. M., de Souza , R. M., Werneck , R. d. O., Stein , B. V., Pazinato , D. V., de Almeida , W. R., Penatti , O. A., Torres , R. d. S. & Rocha , A. 2017. Nearest neighbors distance ratio open-set classifier. Machine Learning 106(3), 359–386.

Google Scholar

Kim , D. & Oh , A. 2017. Hierarchical dirichlet scaling process. Machine Learning 106(3), 387–418.

Google Scholar

Kline , R. B. 2004. Beyond Significance Testing: Reforming Data Analysis Methods in Behavioral Research. American Psychological Association.

Google Scholar

Kotłowski , W. & Dembczyński , K. 2017. Surrogate regret bounds for generalized classification performance metrics. Machine Learning 106(4), 549–572.

Google Scholar

Krijthe , J. H. & Loog , M. 2017. Projected estimators for robust semi-supervised classification. Machine Learning 106(7), 993–1008.

Google Scholar

Langley , P., Iba , W., Thompson , K. 1992. An analysis of Bayesian classifiers. In Proceedings of the Tenth National Conference on Artificial Intelligence (AAAI-92), California, AAAI Press, 90, 223–228.

Google Scholar

Mena , D., Montañés , E., Quevedo , J. R. & Del Coz , J. J. 2017. A family of admissible heuristics for a^* to perform inference in probabilistic classifier chains. Machine Learning 106(1), 143–169.

Google Scholar

ML Journal . 2017. Machine Learning 106(1–12). https://link.springer.com/journal/10994/106/1

Google Scholar

Nakagawa , S. & Cuthill , I. C. 2007. Effect size, confidence interval and statistical significance: a practical guide for biologists. Biological Reviews 82(4), 591–605.

Google Scholar

Neumann , N. M., Plastino , A., Junior , J. A. P. & Freitas , A. A. 2018. Is p-value< 0.05 enough? two case studies in classifiers evaluation (in Portuguese). In Anais do XV Encontro Nacional de Inteligência Artificial e Computacional, SBC, 94–103.

Google Scholar

Osojnik , A., Panov , P. & Džeroski , S. 2017. Multi-label classification via multi-target regression on data streams. Machine Learning 106(6), 745–770.

Google Scholar

Snyder , P. & Lawson , S. 1993. Evaluating results using corrected and uncorrected effect size estimates. The Journal of Experimental Education 61(4), 334–349.

Google Scholar

Sullivan , G. M. & Feinn , R. 2012. Using effect size-or why the p-value is not enough. Journal of Graduate Medical Education 4(3), 279–282.

Google Scholar

Suzumura , S., Ogawa , K., Sugiyama , M., Karasuyama , M. & Takeuchi , I. 2017. Homotopy continuation approaches for robust SV classification and regression. Machine Learning 106(7), 1009–1038.

Google Scholar

Tomczak , M. & Tomczak , E. 2014. The need to report effect size estimates revisited. an overview of some recommended measures of effect size. Trends in Sport Sciences 21(1), 19–25.

Google Scholar

Wasserstein , R. L. & Lazar , N. A. 2016. The ASA’s statement on p-values: context, process, and purpose. The American Statistician 70, 129–133.

Google Scholar

Witten , I. H., Frank , E., Hall , M. A. & Pal , C. J. 2016. Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann.

Google Scholar

Wu , Y. P. & Lin , H. T. 2017. Progressive random k-labelsets for cost-sensitive multi-label classification. Machine Learning 106(5), 671–694.

Google Scholar

Xuan , J., Lu , J., Zhang , G., Da Xu , R. Y. & Luo , X. 2017. A Bayesian nonparametric model for multi-label learning. Machine Learning 106(11), 1787–1815.

Google Scholar

Yu , F. & Zhang , M. L. 2017. Maximum margin partial label learning. Machine Learning 106(4), 573–593.

Google Scholar

Zaidi , N. A., Webb , G. I., Carman , M. J., Petitjean , F., Buntine , W., Hynes , M. & De Sterck , H. 2017. Efficient parameter learning of Bayesian network classifiers. Machine Learning 106(9–10), 1289–1329.

Google Scholar

About this article

Cite this article

Nadine M. Neumann, Alexandre Plastino, Jony A. Pinto Junior, Alex A. Freitas. 2021. Is p-value $ 0.05 enough? A study on the statistical evaluation of classifiers. The Knowledge Engineering Review. 36:417 doi: 10.1017/S0269888920000417

Nadine M. Neumann, Alexandre Plastino, Jony A. Pinto Junior, Alex A. Freitas. 2021. Is p-value

$<$

0.05 enough? A study on the statistical evaluation of classifiers. The Knowledge Engineering Review. 36:417 doi: 10.1017/S0269888920000417

Download PDF

Article Metrics

Article views(156) PDF downloads(81)

{{lists.name}}

Is p-value $<$ 0.05 enough? A study on the statistical evaluation of classifiers

Abstract

Rights and permissions

References

About this article

Cite this article

Article Metrics

Access History

Other Articles By Authors

Is p-value $<$ 0.05 enough? A study on the statistical evaluation of classifiers

HTML

Catalog

{{lists.name}}

Is p-value $<$ 0.05 enough? A study on the statistical evaluation of classifiers

Abstract

Rights and permissions

References

About this article

Cite this article

Article Metrics

Access History

Other Articles By Authors

Is p-value $<$ 0.05 enough? A study on the statistical evaluation of classifiers

HTML

Catalog

Export File

Citation

Format

Content