Discretization as the enabling technique for the Naïve Bayes and semi-Naïve Bayes-based classification

Marcin J. Mizianty; Lukasz A. Kurgan; Marek R. Ogiela; Marcin J. Mizianty; Lukasz A. Kurgan; Marek R. Ogiela

doi:10.1017/S0269888910000329

2010 Volume 25

Article Contents

Next Previous

RESEARCH ARTICLE Open Access

Discretization as the enabling technique for the Naïve Bayes and semi-Naïve Bayes-based classification

1.
Department of Electrical and Computer Engineering
2.
Bio-cybernetics Laboratory

More Information

Corresponding authors: Marcin J. Mizianty ; Lukasz A. Kurgan ; Marek R. Ogiela

Published online: 01 December 2010
The Knowledge Engineering Review 25, Article number: 10.1017/S0269888910000329 (2010) | Cite this article

Abstract

Abstract: Current classification problems that concern data sets of large and increasing size require scalable classification algorithms. In this study, we concentrate on several scalable, linear complexity classifiers that include one of the top 10 voted data mining methods, Naïve Bayes (NB), and several recently proposed semi-NB classifiers. These algorithms perform front-end discretization of the continuous features since by design they work only with nominal or discrete features. We address the lack of studies that investigate the benefits and drawbacks of discretization in the context of the subsequent classification. Our comprehensive empirical study considers 12 discretizers (two unsupervised and 10 supervised), seven classifiers (two classical NB and five semi-NB), and 16 data sets. We investigate the scalability of the discretizers and show that the fastest supervised discretizers fast class-attribute interdependency maximization (FCAIM), class-attribute interdependency maximization (CAIM), and information entropy maximization (IEM) provide discretization schemes with the highest overall quality. We show that discretization improves the classification accuracy when compared against the two classical methods, NB and Flexible Naïve Bayes (FNB), executed on the raw data. The choice of the discretization algorithm impacts the significance of the improvements. The MODL, FCAIM, and CAIM methods provide statistically significant improvements, while the IEM, Class-attribute contingency coefficient (CACC), and Khiops discretizers provide moderate improvements. The most accurate classification models are generated by the Averaged one-dependence estimators (AODEsr) classifier followed by AODE and HNB (Hidden Naïve Bayes). AODEsr run on data discretized with MODL, FCAIM, and CAIM provides statistically significantly better accuracies than both the classical NB methods. The worst results are obtained with the NB, FNB, and LBR (Lazy Bayes rule) classifiers. We show that although the time to build the discretization scheme could be longer than the time to train the classifier, the completion of the entire process (to discretize data, compute the classifier, and predict test instances) is often faster than the NB-based classification of the continuous instances. This is because the time to classify test instances is an important factor that is positively influenced by discretization. The biggest positive influence, both on the accuracy and the classification time, is associated with the MODL, FCAIM, and CAIM algorithms.
Rights and permissions
Copyright © Cambridge University Press 20102010Cambridge University Press

References

Google Scholar

About this article

Cite this article

Marcin J. Mizianty, Lukasz A. Kurgan, Marek R. Ogiela. 2010. Discretization as the enabling technique for the Naïve Bayes and semi-Naïve Bayes-based classification. The Knowledge Engineering Review. 25: doi: 10.1017/S0269888910000329

Marcin J. Mizianty, Lukasz A. Kurgan, Marek R. Ogiela. 2010. Discretization as the enabling technique for the Naïve Bayes and semi-Naïve Bayes-based classification. The Knowledge Engineering Review. 25: doi: 10.1017/S0269888910000329

Download PDF

Article Metrics

Article views(14) PDF downloads(127)

Discretization as the enabling technique for the Naïve Bayes and semi-Naïve Bayes-based classification

1.
Department of Electrical and Computer Engineering
2.
Bio-cybernetics Laboratory

Corresponding authors: Marcin J. Mizianty ; Lukasz A. Kurgan ; Marek R. Ogiela

Published online: 01 December 2010

The Knowledge Engineering Review 25, Article number: 10.1017/S0269888910000329 (2010) | Cite this article

Abstract: Abstract: Current classification problems that concern data sets of large and increasing size require scalable classification algorithms. In this study, we concentrate on several scalable, linear complexity classifiers that include one of the top 10 voted data mining methods, Naïve Bayes (NB), and several recently proposed semi-NB classifiers. These algorithms perform front-end discretization of the continuous features since by design they work only with nominal or discrete features. We address the lack of studies that investigate the benefits and drawbacks of discretization in the context of the subsequent classification. Our comprehensive empirical study considers 12 discretizers (two unsupervised and 10 supervised), seven classifiers (two classical NB and five semi-NB), and 16 data sets. We investigate the scalability of the discretizers and show that the fastest supervised discretizers fast class-attribute interdependency maximization (FCAIM), class-attribute interdependency maximization (CAIM), and information entropy maximization (IEM) provide discretization schemes with the highest overall quality. We show that discretization improves the classification accuracy when compared against the two classical methods, NB and Flexible Naïve Bayes (FNB), executed on the raw data. The choice of the discretization algorithm impacts the significance of the improvements. The MODL, FCAIM, and CAIM methods provide statistically significant improvements, while the IEM, Class-attribute contingency coefficient (CACC), and Khiops discretizers provide moderate improvements. The most accurate classification models are generated by the Averaged one-dependence estimators (AODEsr) classifier followed by AODE and HNB (Hidden Naïve Bayes). AODEsr run on data discretized with MODL, FCAIM, and CAIM provides statistically significantly better accuracies than both the classical NB methods. The worst results are obtained with the NB, FNB, and LBR (Lazy Bayes rule) classifiers. We show that although the time to build the discretization scheme could be longer than the time to train the classifier, the completion of the entire process (to discretize data, compute the classifier, and predict test instances) is often faster than the NB-based classification of the continuous instances. This is because the time to classify test instances is an important factor that is positively influenced by discretization. The biggest positive influence, both on the accuracy and the classification time, is associated with the MODL, FCAIM, and CAIM algorithms.

HTML

Acknowledgments

This work was supported in part by the NSERC Discovery grant to L. Kurgan and by the Killam Memorial Scholarship to M. J. Mizianty.

Rights and permissions

References (55)

About this article

Cite this article

DownLoad: Full-Size Img PowerPoint

Return

{{lists.name}}

Discretization as the enabling technique for the Naïve Bayes and semi-Naïve Bayes-based classification

Abstract