Search
2010 Volume 25
Article Contents
RESEARCH ARTICLE   Open Access    

Discretization as the enabling technique for the Naïve Bayes and semi-Naïve Bayes-based classification

More Information
  • Corresponding authors: Marcin J. Mizianty ;  Lukasz A. Kurgan ;  Marek R. Ogiela
  • Abstract: Current classification problems that concern data sets of large and increasing size require scalable classification algorithms. In this study, we concentrate on several scalable, linear complexity classifiers that include one of the top 10 voted data mining methods, Naïve Bayes (NB), and several recently proposed semi-NB classifiers. These algorithms perform front-end discretization of the continuous features since by design they work only with nominal or discrete features. We address the lack of studies that investigate the benefits and drawbacks of discretization in the context of the subsequent classification. Our comprehensive empirical study considers 12 discretizers (two unsupervised and 10 supervised), seven classifiers (two classical NB and five semi-NB), and 16 data sets. We investigate the scalability of the discretizers and show that the fastest supervised discretizers fast class-attribute interdependency maximization (FCAIM), class-attribute interdependency maximization (CAIM), and information entropy maximization (IEM) provide discretization schemes with the highest overall quality. We show that discretization improves the classification accuracy when compared against the two classical methods, NB and Flexible Naïve Bayes (FNB), executed on the raw data. The choice of the discretization algorithm impacts the significance of the improvements. The MODL, FCAIM, and CAIM methods provide statistically significant improvements, while the IEM, Class-attribute contingency coefficient (CACC), and Khiops discretizers provide moderate improvements. The most accurate classification models are generated by the Averaged one-dependence estimators (AODEsr) classifier followed by AODE and HNB (Hidden Naïve Bayes). AODEsr run on data discretized with MODL, FCAIM, and CAIM provides statistically significantly better accuracies than both the classical NB methods. The worst results are obtained with the NB, FNB, and LBR (Lazy Bayes rule) classifiers. We show that although the time to build the discretization scheme could be longer than the time to train the classifier, the completion of the entire process (to discretize data, compute the classifier, and predict test instances) is often faster than the NB-based classification of the continuous instances. This is because the time to classify test instances is an important factor that is positively influenced by discretization. The biggest positive influence, both on the accuracy and the classification time, is associated with the MODL, FCAIM, and CAIM algorithms.
  • 加载中
  • Cite this article

    Marcin J. Mizianty, Lukasz A. Kurgan, Marek R. Ogiela. 2010. Discretization as the enabling technique for the Naïve Bayes and semi-Naïve Bayes-based classification. The Knowledge Engineering Review. 25: doi: 10.1017/S0269888910000329
    Marcin J. Mizianty, Lukasz A. Kurgan, Marek R. Ogiela. 2010. Discretization as the enabling technique for the Naïve Bayes and semi-Naïve Bayes-based classification. The Knowledge Engineering Review. 25: doi: 10.1017/S0269888910000329

Article Metrics

Article views(14) PDF downloads(127)

RESEARCH ARTICLE   Open Access    

Discretization as the enabling technique for the Naïve Bayes and semi-Naïve Bayes-based classification

  • Corresponding authors: Marcin J. Mizianty ;  Lukasz A. Kurgan ;  Marek R. Ogiela
The Knowledge Engineering Review  25 Article number: 10.1017/S0269888910000329  (2010)  |  Cite this article

Abstract: Abstract: Current classification problems that concern data sets of large and increasing size require scalable classification algorithms. In this study, we concentrate on several scalable, linear complexity classifiers that include one of the top 10 voted data mining methods, Naïve Bayes (NB), and several recently proposed semi-NB classifiers. These algorithms perform front-end discretization of the continuous features since by design they work only with nominal or discrete features. We address the lack of studies that investigate the benefits and drawbacks of discretization in the context of the subsequent classification. Our comprehensive empirical study considers 12 discretizers (two unsupervised and 10 supervised), seven classifiers (two classical NB and five semi-NB), and 16 data sets. We investigate the scalability of the discretizers and show that the fastest supervised discretizers fast class-attribute interdependency maximization (FCAIM), class-attribute interdependency maximization (CAIM), and information entropy maximization (IEM) provide discretization schemes with the highest overall quality. We show that discretization improves the classification accuracy when compared against the two classical methods, NB and Flexible Naïve Bayes (FNB), executed on the raw data. The choice of the discretization algorithm impacts the significance of the improvements. The MODL, FCAIM, and CAIM methods provide statistically significant improvements, while the IEM, Class-attribute contingency coefficient (CACC), and Khiops discretizers provide moderate improvements. The most accurate classification models are generated by the Averaged one-dependence estimators (AODEsr) classifier followed by AODE and HNB (Hidden Naïve Bayes). AODEsr run on data discretized with MODL, FCAIM, and CAIM provides statistically significantly better accuracies than both the classical NB methods. The worst results are obtained with the NB, FNB, and LBR (Lazy Bayes rule) classifiers. We show that although the time to build the discretization scheme could be longer than the time to train the classifier, the completion of the entire process (to discretize data, compute the classifier, and predict test instances) is often faster than the NB-based classification of the continuous instances. This is because the time to classify test instances is an important factor that is positively influenced by discretization. The biggest positive influence, both on the accuracy and the classification time, is associated with the MODL, FCAIM, and CAIM algorithms.

    • This work was supported in part by the NSERC Discovery grant to L. Kurgan and by the Killam Memorial Scholarship to M. J. Mizianty.

    • Copyright © Cambridge University Press 20102010Cambridge University Press
References (55)
  • About this article
    Cite this article
    Marcin J. Mizianty, Lukasz A. Kurgan, Marek R. Ogiela. 2010. Discretization as the enabling technique for the Naïve Bayes and semi-Naïve Bayes-based classification. The Knowledge Engineering Review. 25: doi: 10.1017/S0269888910000329
    Marcin J. Mizianty, Lukasz A. Kurgan, Marek R. Ogiela. 2010. Discretization as the enabling technique for the Naïve Bayes and semi-Naïve Bayes-based classification. The Knowledge Engineering Review. 25: doi: 10.1017/S0269888910000329
  • Catalog

      /

      DownLoad:  Full-Size Img  PowerPoint
      Return
      Return