Scaling up classification rule induction through parallel processing

Frederic Stahl; Max Bramer; Frederic Stahl; Max Bramer

doi:10.1017/S0269888912000355

2013 Volume 28

Article Contents

Next Previous

RESEARCH ARTICLE Open Access

Scaling up classification rule induction through parallel processing

Frederic Stahl¹,
Max Bramer²

1.
School of Systems Engineering
2.
School of Computing

More Information

Published online: 26 November 2012
The Knowledge Engineering Review 28, Article number: 10.1017/S0269888912000355 (2013) | Cite this article

Abstract

Abstract: The fast increase in the size and number of databases demands data mining approaches that are scalable to large amounts of data. This has led to the exploration of parallel computing technologies in order to perform data mining tasks concurrently using several processors. Parallelization seems to be a natural and cost-effective way to scale up data mining technologies. One of the most important of these data mining technologies is the classification of newly recorded data. This paper surveys advances in parallelization in the field of classification rule induction.
Rights and permissions
Copyright © Cambridge University Press 2012 2012Cambridge University Press

References

Berrar D., Stahl F., Silva C. S. G., Rodrigues J. R., Brito R. M. M., Dubitzky W.2005. Towards data warehousing and mining of protein unfolding simulation data. Journal of Clinical Monitoring and Computing19, 307–317.

Google Scholar

Bramer M. A.2000. Automatic induction of classification rules from examples using N-Prism. In Research and Development in Intelligent Systems XVI, Bramer, M. A., Macintosh, A. & Coenen, F. (eds). Springer-Verlag, 99–121.

Google Scholar

Bramer M. A.2002. An information-theoretic approach to the pre-pruning of classification rules. In Intelligent Information Processing, Musen, B. N. M. & Studer, R. (eds). Kluwer, 201–212.

Google Scholar

Bramer M. A.2005. Inducer: a public domain workbench for data mining. International Journal of Systems Science36(14), 909–919.

Google Scholar

Bramer M. A.2007. Principles of Data Mining. Springer.

Google Scholar

Breiman L.1996. Bagging predictors. Machine Learning24(2), 123–140.

Google Scholar

Breiman L.2001. Random forests. Machine Learning45(1), 5–32.

Google Scholar

Breiman L., Friedman J. H., Olshen R. A., Stone C. J.1984. Classification and regression trees. Wadsworth Publishing Company.

Google Scholar

Caragea D., Silvescu A., Honavar V.2003. Decision tree induction from distributed heterogeneous autonomous data sources. In Proceedings of the Conference on Intelligent Systems Design and Applications (ISDA 03). Springer-Verlag, 341–350.

Google Scholar

Catlett J.1991. Megainduction: Machine Learning on Very Large Databases. Unpublished doctoral dissertation, University of Technology Sydney.

Google Scholar

Cendrowska J.1987. PRISM: an algorithm for inducing modular rules. International Journal of Man–Machine Studies27, 349–370.

Google Scholar

Chan P., Stolfo S. J.1993a. Experiments on multistrategy learning by meta learning. In Proceedings of 2nd International Conference on Information and Knowledge Management, Washington, DC, United States, 314–323.

Google Scholar

Chan P., Stolfo S. J.1993b. Meta-Learning for multi strategy and parallel learning. In Proceedings of 2nd International Workshop on Multistrategy Learning, Harpers Ferry, West Virginia United States, 150–165.

Google Scholar

Clark P., Niblett T.1989. The CN2 induction algorithm. Machine Learning3(4), 261–283.

Google Scholar

Cohen W. W.1995. Fast effective rule induction. In Proceedings of the 12th International Conference on Machine Learning. Morgan Kaufmann, 115–123.

Google Scholar

Erman L. D., Hayes-Roth F., Lesser V. R., Reddy D. R.1980. The Hearsay-II Speech-Understanding system: integrating knowledge to resolve uncertainty. ACM Computing Surveys (CSUR)12(2), 213–253.

Google Scholar

Freitas A.1998. A survey of parallel data mining. In Proceedings of the 2nd International Conference on the Practical Applications of Knowledge Discovery and Data Mining, London, 287–300.

Google Scholar

Frey L. J., Fisher D. H.1999. Modelling decision tree performance with the power law. In Proceedings of the 7th International Workshop on Artificial Intelligence and Statistics, Fort Lauderdale, Florida, USA, 59–65.

Google Scholar

Fuernkranz J.1998. Integrative windowing. Journal of Artificial Intelligence Research8, 129–164.

Google Scholar

Goldberg D.1989. Genetic Algorithms in Search, Optimization and Machine Learning. Addison-Wesley.

Google Scholar

GSC-II 2012. (http://tdc-www.harvard.edu/catalogs/gsc2.html).

Google Scholar

Han J., Kamber M.2001. Data Mining: Concepts and Techniques. Morgan Kaufmann.

Google Scholar

Hillis W., Steele L.1986. Data parallel algorithms. Communications of the ACM29(12), 1170–1183.

Google Scholar

Ho T. K.1995. Random decision forests. Proceedings of the 3rdInternational Conference on Document Analysis and Recognition, Montreal, Canada, 1, 278.

Google Scholar

Hunt E. B., Stone P. J., Marin J.1966. Experiments in Induction. Academic Press.

Google Scholar

Joshi M., Karypis G., Kumar V.1998. Scalparc: a new scalable and efficient parallel classification algorithm for mining large datasets. In Proceedings of the 1st Merged International Parallel Processing Symposium and Symposium on Parallel and Distributed Processing, IPPS/SPDP 1998, Orlando, Florida, 573–579.

Google Scholar

Kargupta H., Park B. H., Hershberger D., Johnson E.1999. Collective data mining: a new perspective toward distributed data analysis. In Advances in Distributed and Parallel Knowledge Discovery, Kargupta, H. & Chan, P. (eds). AAAI/MIT Press, 133–184.

Google Scholar

Kerber R.1992. Chimerge: discretization of numeric attributes. In Proceedings of theAAAI, San Jose, California, 123–128.

Google Scholar

Lippmann R. P.1988. An introduction to computing with neural nets. SIGARCH Computer Architecture News16(1), 7–25.

Google Scholar

Metha M., Agrawal R., Rissanen J.1996. SLIQ: a fast scalable classifier for data mining. In Proceedings of the 5th International Conference on Extending Database Technology: Advances in Database Technology. Springer, 1057, 18–32.

Google Scholar

Michalski R. S.1969. On the Quasi-Minimal solution of the general covering problem. In Proceedings of the 5th International Symposium on Information Processing, Bled, Yugoslavia, 125–128.

Google Scholar

Minitab2010. (http://www.minitab.com/).

Google Scholar

Park B., Kargupta H.2002. Distributed data mining: algorithms, systems and applications. In Data Mining Handbook. IEA, 341–358.

Google Scholar

Provost F.2000. Distributed data mining: scaling up and beyond. In Advances in Distributed and Parallel Knowledge Discovery, Kargupta, H. & Chan, P. (eds). MIT Press, 3–27.

Google Scholar

Provost F., Hennessy D. N.1994. Distributed machine learning: scaling up with coarse-grained parallelism. In Proceedings of the 2nd International Conference on Intelligent Systems for Molecular Biology, Stanford, California, 340–347.

Google Scholar

Provost F., Hennessy D. N.1996. Scaling up: distributed machine learning with cooperation. In Proceedings of the 13th National Conference on Artificial Intelligence. AAAI Press, 74–79.

Google Scholar

Provost F., Jensen D., Oates T.1999. Efficient progressive sampling. In Proceedings of theInternational Conference on Knowledge Discovery and Data Mining. ACM, 23–32.

Google Scholar

Quinlan R. J.1979a. Discovering rules by induction from large collections of examples. In Expert Systems in the Micro-Electronic Age. Edinburgh University Press.

Google Scholar

Quinlan R. J.1979b. Induction Over Large Databases. Michie, D. (ed.). Technical No. STAN-CS-739, Stanford University, 168–201.

Google Scholar

Quinlan R. J.1983. Learning efficient classification procedures and their applications to chess endgames. In Machine Learning: An AI Approach, Michalski, R. S., Carbonell, J. G. & Mitchell, T. M. (eds). Morgan Kaufmann, 463–482.

Google Scholar

Quinlan R. J.1986. Induction of decision trees. Machine Learning1(1), 81–106.

Google Scholar

Quinlan R. J.1993. C4.5: Programs for Machine Learning. Morgan Kaufmann.

Google Scholar

SAS/STAT2010. (http://www.sas.com/).

Google Scholar

Segal M. R.2004. Machine Learning Benchmarks and Random Forest Regression. Center for Bioinformatics & Molecular Biostatistics, University of California.

Google Scholar

Shafer J., Agrawal R., Metha M.1996. SPRINT: a scalable parallel classifier for data mining. In Proceedings of the 22nd International Conference on Very Large Databases. Morgan Kaufmann, 544–555.

Google Scholar

Shannon C. E.1948. A mathematical theory of communication. The Bell System Technical Journal27.

Google Scholar

Sirvastava A., Han E., Kumar V., Singh V.1999. Parallel formulations of Decision-Tree classification algorithms. Data Mining and Knowledge Discovery3, 237–261.

Google Scholar

Smyth P., Goodman R. M.1992. An information theoretic approach to rule induction from databases. Transactions on Knowledge and Data Engineering4(4), 301–316.

Google Scholar

Stahl F.2009. Parallel Rule Induction. Unpublished doctoral dissertation, University of Portsmouth.

Google Scholar

Stahl F., Berrar D., Silva C. S. G., Rodrigues J. R., Brito R. M. M., Dubitzky W.2005. Grid warehousing of molecular dynamics protein unfolding data. In Proceedings of the 15th IEEE/ACM International Symposium on Cluster Computing and the Grid. IEEE/ACM, 496–503.

Google Scholar

Stahl F., Bramer M., Adda M.2008. Parallel induction of modular classification rules. In Proceedings of SGAI Conference (p. lookup-lookup). Springer.

Google Scholar

Stahl F., Bramer M., Adda M.2009a. Parallel rule induction with information theoretic pre-pruning. In Proceedings of theSGAI Conference, 151–164.

Google Scholar

Stahl F., Bramer M. A., Adda M.2009b. PMCRI: a parallel modular classification rule induction framework. In Proceedings of MLDM. Springer, 148–162.

Google Scholar

Stahl F., Bramer M., Adda M.2010. J-PMCRI: a methodology for inducing pre-pruned modular classification rules. In Artificial Intelligence in Theory and Practice III, Bramer, M. A. (ed.). Springer, 47–56.

Google Scholar

Stankovski V., Swain M., Kravtsov V., Niessen T., Wegener D., Roehm M.2008. Digging deep into the data mine with DataMiningGrid. IEEE Internet Computing12, 69–76.

Google Scholar

Szalay A.1998. The Evolving Universe. ASSL 231.

Google Scholar

Way J., Smith E. A.1991. The evolution of synthetic aperture radar systems and their progression to the EOS SAR. IEEE Transactions on Geoscience and Remote Sensing29(6), 962–985.

Google Scholar

Wirth J., Catlett J.1988. Experiments on the costs and benefits of windowing in ID3. In Proceedings of the 5th International Conference on Machine Learning(ML-88). Morgan Kaufmann, 87–95.

Google Scholar

Witten I. H., Eibe F.1999. Data mining: practical machine learning tools and techniques with java implementations. Morgan Kaufmann.

Google Scholar

About this article

Cite this article

Frederic Stahl, Max Bramer. 2013. Scaling up classification rule induction through parallel processing. The Knowledge Engineering Review. 28:355 doi: 10.1017/S0269888912000355

Frederic Stahl, Max Bramer. 2013. Scaling up classification rule induction through parallel processing. The Knowledge Engineering Review. 28:355 doi: 10.1017/S0269888912000355

Download PDF

Article Metrics

Article views(62) PDF downloads(56)

{{lists.name}}

Scaling up classification rule induction through parallel processing

Abstract

Rights and permissions

References

About this article

Cite this article

Article Metrics

Access History

Other Articles By Authors