Review and comparison of Apriori algorithm implementations on Hadoop-MapReduce and Spark

Eduardo P. S. Castro; Thiago D. Maia; Marluce R. Pereira; Ahmed A. A. Esmin; Denilson A. Pereira; Eduardo P. S. Castro; Thiago D. Maia; Marluce R. Pereira; Ahmed A. A. Esmin; Denilson A. Pereira

doi:10.1017/S0269888918000127

2018 Volume 33

Article Contents

Next Previous

RESEARCH ARTICLE Open Access

Review and comparison of Apriori algorithm implementations on Hadoop-MapReduce and Spark

Department of Computer Science

More Information

Published online: 11 July 2018
The Knowledge Engineering Review 33, Article number: e9 (2018) | Cite this article

Abstract

Abstract: Several Apriori algorithm implementations for mining association rules have been proposed in the literature using the Hadoop-MapReduce framework and, more recently, Spark. However, none of the works have made a detailed assessment of its performance, for example, comparing it with other implementations in various characteristics of data sets. In this work, we present a review of the main algorithms proposed for Hadoop-MapReduce and compared their implementations in a single environment under several different situations. Moreover, these algorithms had their implementations adapted to Spark, and also compared under the same circumstances. Based on the results of the experiments, we present a framework for recommending the Apriori implementation most appropriate for solving a given problem, according to the data set characteristics and minimum required support. The results show that Spark implementations overcome Hadoop-MapReduce implementations at runtime in most experiments. However, there is no single implementation that is the best in all the evaluated situations.
Rights and permissions
© Cambridge University Press, 2018 2018Cambridge University Press

References

Agrawal R., Imielinski T. & Swami A. 1993. Mining association rules between sets of items in large databases. In Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data 22, 207–216. ACM.

Google Scholar

Agrawal R. & Srikant R. 1994. Fast algorithms for mining association rules. In Proceedings of the 20th International Conference on Very Large Data Bases, VLDB’94, 487–499. Morgan Kaufmann Publishers Inc.

Google Scholar

Apache. 2016. What is Apache Hadoop. http://hadoop.apache.org/#What+Is+Apache+Hadoop, accessed January, 2016.

Google Scholar

Apache Spark. 2016. Apache Spark lightning-fast cluster computing. http://spark.apache.org/, accessed January, 2016.

Google Scholar

Apache Yarn. 2016. Apache Hadoop YARN. http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html, accessed January, 2016.

Google Scholar

Dean J. & Ghemawat S. 2004. MapReduce: simplified data processing on large clusters. In Proceedings of the 6th Symposium on Operating Systems Design and Implementation, 395–408. USENIX Association.

Google Scholar

Farzanyar Z. & Cercone N. 2013a. Accelerating frequent itemsets mining on the cloud: a mapreduce-based approach. In Proceedings of the 14th IEEE Conference on Data Mining Workshops, ICDMW’13, 592–598. IEEE Computer Society.

Google Scholar

Farzanyar Z. & Cercone N. 2013b. Efficient mining of frequent itemsets in social network data based on MapReduce framework. In Proceedings of the 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining in ASONAM’13, 1183–1188. ACM.

Google Scholar

Ghemawat S., Gobioff H. & Leung S.-T. 2003. The Google File System. In Proceedings of the 19th ACM Symposium on Operating Systems Principles, SOSP’03, 37, 29–43. ACM.

Google Scholar

Hahsler M., Grun B., Hornik K. & Buchta C. 2016. Introduction to arules: a computational environment for mining association rules and frequent itemsets. https://cran.r-project.org/web/packages/arules/vignettes/arules.pdf, accessed January, 2016.

Google Scholar

Li L. & Zhang M. 2011. The strategy of mining association rule based on cloud computing. In Proceedings of the 2011 International Conference on Business Computing and Global Informatization in BCGIN’11, 475–478. IEEE Computer Society.

Google Scholar

Li N., Zeng L., He Q. & Shi Z. 2012. Parallel implementation of apriori algorithm based on mapreduce. In Proceedings of the 13th Conference on Software Engineering, Artificial Intelligence, Networking and Parallel Distributed Computing, SNPD’12, 236–241. IEEE Computer Society.

Google Scholar

Lin M.-Y., Lee P.-Y. & Hsueh S.-C. 2012. Apriori-based frequent itemset mining algorithms on MapReduce. In Proceedings of the 6th International Conference on Ubiquitous Information Management and Communication in ICUIMC’12, 1–8. ACM.

Google Scholar

Oliveira C. M. & Pereira D. A. 2017. An association rules based method for classifying product offers from e-shopping. Intelligent Data Analysis 21(3), 637–660.

Google Scholar

Mazur E., Li B., Diao Y., McGregor A. & Shenoy P. 2012. SCALLA: a platform for scalable one-pass analytics using MapReduce. ACM Transactions on Database Systems 37(4), 27.

Google Scholar

Qiu H., Gu R., Yuan C. & Huang Y. 2014. Yafim: a parallel frequent itemset mining algorithm with Spark. In Proceedings of the 28th IEEE International Distributed Processing Symposium Workshops, IPDPSW’14, 1664–1671.

Google Scholar

Rathee S., Kaul M. & Kashyap A. 2015. R-Apriori: an efficient apriori based algorithm on Spark. In Proceedings of the 8th Workshop in Information and Knowledge Management, CIKM’15, 27–34. ACM.

Google Scholar

SINTEF 2013. Big Data, for better or worse: 90% of world’s data generated over last two years. www.sciencedaily.com/releases/2013/05/130522085217.htm, accessed January 22, 2016.

Google Scholar

Wedyan S. 2014. Review and comparison of associative classification data mining approaches. International Journal of Computer, Electrical, Automation, Control and Information Engineering 8(1), 34–45.

Google Scholar

White T. 2015. Hadoop: The Definitive Guide, 4th edition. O’Reilly Media.

Google Scholar

Witten I. H., Frank E. & Hall M. A. 2011. Data Mining – Practical Machine Learning Tools and Techniques. Morgan Kaufmann.

Google Scholar

Yahya O., Hegazy O. & Ezat E. 2012. An efficient implementation of Apriori algorithm based on Hadoop-MapReduce model. International Journal of Reviews in Computing 12, 59–67.

Google Scholar

Yang X. Y., Liu Z. & Fu Y. 2010. MapReduce as a programming model for association rules algorithm on Hadoop. In Proceedings of the 3rd Conference on Information Sciences and Interaction Sciences, ICIS’10, 99–102. IEEE.

Google Scholar

Zaharia M., Chowdhury M., Franklin M. J., Shenker S. & Stoica I. 2010. Spark: cluster computing with working sets. In Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing in HotCloud’10, 1–7. USENIX Association.

Google Scholar

Zhou X. & Huang Y. 2014. An improved parallel association rules algorithm based on MapReduce framework for big data. In Proceedings of the 11th Conference on Fuzzy Systems and Knowledge Discovery, FSKD’14, 284–288. IEEE.

Google Scholar

About this article

Cite this article

Eduardo P. S. Castro, Thiago D. Maia, Marluce R. Pereira, Ahmed A. A. Esmin, Denilson A. Pereira. 2018. Review and comparison of Apriori algorithm implementations on Hadoop-MapReduce and Spark. The Knowledge Engineering Review. 33:127 doi: 10.1017/S0269888918000127

Eduardo P. S. Castro, Thiago D. Maia, Marluce R. Pereira, Ahmed A. A. Esmin, Denilson A. Pereira. 2018. Review and comparison of Apriori algorithm implementations on Hadoop-MapReduce and Spark. The Knowledge Engineering Review. 33:127 doi: 10.1017/S0269888918000127

Download PDF

Article Metrics

Article views(147) PDF downloads(163)

Review and comparison of Apriori algorithm implementations on Hadoop-MapReduce and Spark

Department of Computer Science

Published online: 11 July 2018

The Knowledge Engineering Review 33, Article number: e9 (2018) | Cite this article

Abstract: Abstract: Several Apriori algorithm implementations for mining association rules have been proposed in the literature using the Hadoop-MapReduce framework and, more recently, Spark. However, none of the works have made a detailed assessment of its performance, for example, comparing it with other implementations in various characteristics of data sets. In this work, we present a review of the main algorithms proposed for Hadoop-MapReduce and compared their implementations in a single environment under several different situations. Moreover, these algorithms had their implementations adapted to Spark, and also compared under the same circumstances. Based on the results of the experiments, we present a framework for recommending the Apriori implementation most appropriate for solving a given problem, according to the data set characteristics and minimum required support. The results show that Spark implementations overcome Hadoop-MapReduce implementations at runtime in most experiments. However, there is no single implementation that is the best in all the evaluated situations.

HTML

Acknowledgments

This work was partially supported by CNPq, FAPEMIG grant CEX-APQ-01834-14 and an individual scholarship from CAPES.

Language created in 1960 (http://history.siam.org/sup/Fox_1960_LISP.pdf).

Rights and permissions

References (25)

About this article

Cite this article

Eduardo P. S. Castro, Thiago D. Maia, Marluce R. Pereira, Ahmed A. A. Esmin, Denilson A. Pereira. 2018. Review and comparison of Apriori algorithm implementations on Hadoop-MapReduce and Spark. The Knowledge Engineering Review. 33:127 doi: 10.1017/S0269888918000127

DownLoad: Full-Size Img PowerPoint

Return

{{lists.name}}

Review and comparison of Apriori algorithm implementations on Hadoop-MapReduce and Spark

Abstract