Search
2013 Volume 28
Article Contents
RESEARCH ARTICLE   Open Access    

Scaling up classification rule induction through parallel processing

More Information
  • Abstract: The fast increase in the size and number of databases demands data mining approaches that are scalable to large amounts of data. This has led to the exploration of parallel computing technologies in order to perform data mining tasks concurrently using several processors. Parallelization seems to be a natural and cost-effective way to scale up data mining technologies. One of the most important of these data mining technologies is the classification of newly recorded data. This paper surveys advances in parallelization in the field of classification rule induction.
  • 加载中
  • Berrar D., Stahl F., Silva C. S. G., Rodrigues J. R., Brito R. M. M., Dubitzky W.2005. Towards data warehousing and mining of protein unfolding simulation data. Journal of Clinical Monitoring and Computing19, 307–317.

    Google Scholar

    Bramer M. A.2000. Automatic induction of classification rules from examples using N-Prism. In Research and Development in Intelligent Systems XVI, Bramer, M. A., Macintosh, A. & Coenen, F. (eds). Springer-Verlag, 99–121.

    Google Scholar

    Bramer M. A.2002. An information-theoretic approach to the pre-pruning of classification rules. In Intelligent Information Processing, Musen, B. N. M. & Studer, R. (eds). Kluwer, 201–212.

    Google Scholar

    Bramer M. A.2005. Inducer: a public domain workbench for data mining. International Journal of Systems Science36(14), 909–919.

    Google Scholar

    Bramer M. A.2007. Principles of Data Mining. Springer.

    Google Scholar

    Breiman L.1996. Bagging predictors. Machine Learning24(2), 123–140.

    Google Scholar

    Breiman L.2001. Random forests. Machine Learning45(1), 5–32.

    Google Scholar

    Breiman L., Friedman J. H., Olshen R. A., Stone C. J.1984. Classification and regression trees. Wadsworth Publishing Company.

    Google Scholar

    Caragea D., Silvescu A., Honavar V.2003. Decision tree induction from distributed heterogeneous autonomous data sources. In Proceedings of the Conference on Intelligent Systems Design and Applications (ISDA 03). Springer-Verlag, 341–350.

    Google Scholar

    Catlett J.1991. Megainduction: Machine Learning on Very Large Databases. Unpublished doctoral dissertation, University of Technology Sydney.

    Google Scholar

    Cendrowska J.1987. PRISM: an algorithm for inducing modular rules. International Journal of Man–Machine Studies27, 349–370.

    Google Scholar

    Chan P., Stolfo S. J.1993a. Experiments on multistrategy learning by meta learning. In Proceedings of 2nd International Conference on Information and Knowledge Management, Washington, DC, United States, 314–323.

    Google Scholar

    Chan P., Stolfo S. J.1993b. Meta-Learning for multi strategy and parallel learning. In Proceedings of 2nd International Workshop on Multistrategy Learning, Harpers Ferry, West Virginia United States, 150–165.

    Google Scholar

    Clark P., Niblett T.1989. The CN2 induction algorithm. Machine Learning3(4), 261–283.

    Google Scholar

    Cohen W. W.1995. Fast effective rule induction. In Proceedings of the 12th International Conference on Machine Learning. Morgan Kaufmann, 115–123.

    Google Scholar

    Erman L. D., Hayes-Roth F., Lesser V. R., Reddy D. R.1980. The Hearsay-II Speech-Understanding system: integrating knowledge to resolve uncertainty. ACM Computing Surveys (CSUR)12(2), 213–253.

    Google Scholar

    Freitas A.1998. A survey of parallel data mining. In Proceedings of the 2nd International Conference on the Practical Applications of Knowledge Discovery and Data Mining, London, 287–300.

    Google Scholar

    Frey L. J., Fisher D. H.1999. Modelling decision tree performance with the power law. In Proceedings of the 7th International Workshop on Artificial Intelligence and Statistics, Fort Lauderdale, Florida, USA, 59–65.

    Google Scholar

    Fuernkranz J.1998. Integrative windowing. Journal of Artificial Intelligence Research8, 129–164.

    Google Scholar

    Goldberg D.1989. Genetic Algorithms in Search, Optimization and Machine Learning. Addison-Wesley.

    Google Scholar

    GSC-II 2012. (http://tdc-www.harvard.edu/catalogs/gsc2.html).

    Google Scholar

    Han J., Kamber M.2001. Data Mining: Concepts and Techniques. Morgan Kaufmann.

    Google Scholar

    Hillis W., Steele L.1986. Data parallel algorithms. Communications of the ACM29(12), 1170–1183.

    Google Scholar

    Ho T. K.1995. Random decision forests. Proceedings of the 3rdInternational Conference on Document Analysis and Recognition, Montreal, Canada, 1, 278.

    Google Scholar

    Hunt E. B., Stone P. J., Marin J.1966. Experiments in Induction. Academic Press.

    Google Scholar

    Joshi M., Karypis G., Kumar V.1998. Scalparc: a new scalable and efficient parallel classification algorithm for mining large datasets. In Proceedings of the 1st Merged International Parallel Processing Symposium and Symposium on Parallel and Distributed Processing, IPPS/SPDP 1998, Orlando, Florida, 573–579.

    Google Scholar

    Kargupta H., Park B. H., Hershberger D., Johnson E.1999. Collective data mining: a new perspective toward distributed data analysis. In Advances in Distributed and Parallel Knowledge Discovery, Kargupta, H. & Chan, P. (eds). AAAI/MIT Press, 133–184.

    Google Scholar

    Kerber R.1992. Chimerge: discretization of numeric attributes. In Proceedings of theAAAI, San Jose, California, 123–128.

    Google Scholar

    Lippmann R. P.1988. An introduction to computing with neural nets. SIGARCH Computer Architecture News16(1), 7–25.

    Google Scholar

    Metha M., Agrawal R., Rissanen J.1996. SLIQ: a fast scalable classifier for data mining. In Proceedings of the 5th International Conference on Extending Database Technology: Advances in Database Technology. Springer, 1057, 18–32.

    Google Scholar

    Michalski R. S.1969. On the Quasi-Minimal solution of the general covering problem. In Proceedings of the 5th International Symposium on Information Processing, Bled, Yugoslavia, 125–128.

    Google Scholar

    Minitab2010. (http://www.minitab.com/).

    Google Scholar

    Park B., Kargupta H.2002. Distributed data mining: algorithms, systems and applications. In Data Mining Handbook. IEA, 341–358.

    Google Scholar

    Provost F.2000. Distributed data mining: scaling up and beyond. In Advances in Distributed and Parallel Knowledge Discovery, Kargupta, H. & Chan, P. (eds). MIT Press, 3–27.

    Google Scholar

    Provost F., Hennessy D. N.1994. Distributed machine learning: scaling up with coarse-grained parallelism. In Proceedings of the 2nd International Conference on Intelligent Systems for Molecular Biology, Stanford, California, 340–347.

    Google Scholar

    Provost F., Hennessy D. N.1996. Scaling up: distributed machine learning with cooperation. In Proceedings of the 13th National Conference on Artificial Intelligence. AAAI Press, 74–79.

    Google Scholar

    Provost F., Jensen D., Oates T.1999. Efficient progressive sampling. In Proceedings of theInternational Conference on Knowledge Discovery and Data Mining. ACM, 23–32.

    Google Scholar

    Quinlan R. J.1979a. Discovering rules by induction from large collections of examples. In Expert Systems in the Micro-Electronic Age. Edinburgh University Press.

    Google Scholar

    Quinlan R. J.1979b. Induction Over Large Databases. Michie, D. (ed.). Technical No. STAN-CS-739, Stanford University, 168–201.

    Google Scholar

    Quinlan R. J.1983. Learning efficient classification procedures and their applications to chess endgames. In Machine Learning: An AI Approach, Michalski, R. S., Carbonell, J. G. & Mitchell, T. M. (eds). Morgan Kaufmann, 463–482.

    Google Scholar

    Quinlan R. J.1986. Induction of decision trees. Machine Learning1(1), 81–106.

    Google Scholar

    Quinlan R. J.1993. C4.5: Programs for Machine Learning. Morgan Kaufmann.

    Google Scholar

    SAS/STAT2010. (http://www.sas.com/).

    Google Scholar

    Segal M. R.2004. Machine Learning Benchmarks and Random Forest Regression. Center for Bioinformatics & Molecular Biostatistics, University of California.

    Google Scholar

    Shafer J., Agrawal R., Metha M.1996. SPRINT: a scalable parallel classifier for data mining. In Proceedings of the 22nd International Conference on Very Large Databases. Morgan Kaufmann, 544–555.

    Google Scholar

    Shannon C. E.1948. A mathematical theory of communication. The Bell System Technical Journal27.

    Google Scholar

    Sirvastava A., Han E., Kumar V., Singh V.1999. Parallel formulations of Decision-Tree classification algorithms. Data Mining and Knowledge Discovery3, 237–261.

    Google Scholar

    Smyth P., Goodman R. M.1992. An information theoretic approach to rule induction from databases. Transactions on Knowledge and Data Engineering4(4), 301–316.

    Google Scholar

    Stahl F.2009. Parallel Rule Induction. Unpublished doctoral dissertation, University of Portsmouth.

    Google Scholar

    Stahl F., Berrar D., Silva C. S. G., Rodrigues J. R., Brito R. M. M., Dubitzky W.2005. Grid warehousing of molecular dynamics protein unfolding data. In Proceedings of the 15th IEEE/ACM International Symposium on Cluster Computing and the Grid. IEEE/ACM, 496–503.

    Google Scholar

    Stahl F., Bramer M., Adda M.2008. Parallel induction of modular classification rules. In Proceedings of SGAI Conference (p. lookup-lookup). Springer.

    Google Scholar

    Stahl F., Bramer M., Adda M.2009a. Parallel rule induction with information theoretic pre-pruning. In Proceedings of theSGAI Conference, 151–164.

    Google Scholar

    Stahl F., Bramer M. A., Adda M.2009b. PMCRI: a parallel modular classification rule induction framework. In Proceedings of MLDM. Springer, 148–162.

    Google Scholar

    Stahl F., Bramer M., Adda M.2010. J-PMCRI: a methodology for inducing pre-pruned modular classification rules. In Artificial Intelligence in Theory and Practice III, Bramer, M. A. (ed.). Springer, 47–56.

    Google Scholar

    Stankovski V., Swain M., Kravtsov V., Niessen T., Wegener D., Roehm M.2008. Digging deep into the data mine with DataMiningGrid. IEEE Internet Computing12, 69–76.

    Google Scholar

    Szalay A.1998. The Evolving Universe. ASSL 231.

    Google Scholar

    Way J., Smith E. A.1991. The evolution of synthetic aperture radar systems and their progression to the EOS SAR. IEEE Transactions on Geoscience and Remote Sensing29(6), 962–985.

    Google Scholar

    Wirth J., Catlett J.1988. Experiments on the costs and benefits of windowing in ID3. In Proceedings of the 5th International Conference on Machine Learning(ML-88). Morgan Kaufmann, 87–95.

    Google Scholar

    Witten I. H., Eibe F.1999. Data mining: practical machine learning tools and techniques with java implementations. Morgan Kaufmann.

    Google Scholar

  • Cite this article

    Frederic Stahl, Max Bramer. 2013. Scaling up classification rule induction through parallel processing. The Knowledge Engineering Review 28(4)451−478, doi: 10.1017/S0269888912000355
    Frederic Stahl, Max Bramer. 2013. Scaling up classification rule induction through parallel processing. The Knowledge Engineering Review 28(4)451−478, doi: 10.1017/S0269888912000355

Article Metrics

Article views(16) PDF downloads(23)

Other Articles By Authors

RESEARCH ARTICLE   Open Access    

Scaling up classification rule induction through parallel processing

The Knowledge Engineering Review  28 2013, 28(4): 451−478  |  Cite this article

Abstract: Abstract: The fast increase in the size and number of databases demands data mining approaches that are scalable to large amounts of data. This has led to the exploration of parallel computing technologies in order to perform data mining tasks concurrently using several processors. Parallelization seems to be a natural and cost-effective way to scale up data mining technologies. One of the most important of these data mining technologies is the classification of newly recorded data. This paper surveys advances in parallelization in the field of classification rule induction.

    • We would like to thank the University of Portsmouth and in particular the School of Computing for providing a PhD student bursary that supported this research.

    • Copyright © Cambridge University Press 2012 2012Cambridge University Press
References (59)
  • About this article
    Cite this article
    Frederic Stahl, Max Bramer. 2013. Scaling up classification rule induction through parallel processing. The Knowledge Engineering Review 28(4)451−478, doi: 10.1017/S0269888912000355
    Frederic Stahl, Max Bramer. 2013. Scaling up classification rule induction through parallel processing. The Knowledge Engineering Review 28(4)451−478, doi: 10.1017/S0269888912000355
  • Catalog

      /

      DownLoad:  Full-Size Img  PowerPoint
      Return
      Return