Search
2018 Volume 33
Article Contents
RESEARCH ARTICLE   Open Access    

Review and comparison of Apriori algorithm implementations on Hadoop-MapReduce and Spark

More Information
  • Abstract: Several Apriori algorithm implementations for mining association rules have been proposed in the literature using the Hadoop-MapReduce framework and, more recently, Spark. However, none of the works have made a detailed assessment of its performance, for example, comparing it with other implementations in various characteristics of data sets. In this work, we present a review of the main algorithms proposed for Hadoop-MapReduce and compared their implementations in a single environment under several different situations. Moreover, these algorithms had their implementations adapted to Spark, and also compared under the same circumstances. Based on the results of the experiments, we present a framework for recommending the Apriori implementation most appropriate for solving a given problem, according to the data set characteristics and minimum required support. The results show that Spark implementations overcome Hadoop-MapReduce implementations at runtime in most experiments. However, there is no single implementation that is the best in all the evaluated situations.
  • 加载中
  • Agrawal R., Imielinski T. & Swami A. 1993. Mining association rules between sets of items in large databases. In Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data 22, 207–216. ACM.

    Google Scholar

    Agrawal R. & Srikant R. 1994. Fast algorithms for mining association rules. In Proceedings of the 20th International Conference on Very Large Data Bases, VLDB’94, 487–499. Morgan Kaufmann Publishers Inc.

    Google Scholar

    Apache. 2016. What is Apache Hadoop. http://hadoop.apache.org/#What+Is+Apache+Hadoop, accessed January, 2016.

    Google Scholar

    Apache Spark. 2016. Apache Spark lightning-fast cluster computing. http://spark.apache.org/, accessed January, 2016.

    Google Scholar

    Apache Yarn. 2016. Apache Hadoop YARN. http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html, accessed January, 2016.

    Google Scholar

    Dean J. & Ghemawat S. 2004. MapReduce: simplified data processing on large clusters. In Proceedings of the 6th Symposium on Operating Systems Design and Implementation, 395–408. USENIX Association.

    Google Scholar

    Farzanyar Z. & Cercone N. 2013a. Accelerating frequent itemsets mining on the cloud: a mapreduce-based approach. In Proceedings of the 14th IEEE Conference on Data Mining Workshops, ICDMW’13, 592–598. IEEE Computer Society.

    Google Scholar

    Farzanyar Z. & Cercone N. 2013b. Efficient mining of frequent itemsets in social network data based on MapReduce framework. In Proceedings of the 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining in ASONAM’13, 1183–1188. ACM.

    Google Scholar

    Ghemawat S., Gobioff H. & Leung S.-T. 2003. The Google File System. In Proceedings of the 19th ACM Symposium on Operating Systems Principles, SOSP’03, 37, 29–43. ACM.

    Google Scholar

    Hahsler M., Grun B., Hornik K. & Buchta C. 2016. Introduction to arules: a computational environment for mining association rules and frequent itemsets. https://cran.r-project.org/web/packages/arules/vignettes/arules.pdf, accessed January, 2016.

    Google Scholar

    Li L. & Zhang M. 2011. The strategy of mining association rule based on cloud computing. In Proceedings of the 2011 International Conference on Business Computing and Global Informatization in BCGIN’11, 475–478. IEEE Computer Society.

    Google Scholar

    Li N., Zeng L., He Q. & Shi Z. 2012. Parallel implementation of apriori algorithm based on mapreduce. In Proceedings of the 13th Conference on Software Engineering, Artificial Intelligence, Networking and Parallel Distributed Computing, SNPD’12, 236–241. IEEE Computer Society.

    Google Scholar

    Lin M.-Y., Lee P.-Y. & Hsueh S.-C. 2012. Apriori-based frequent itemset mining algorithms on MapReduce. In Proceedings of the 6th International Conference on Ubiquitous Information Management and Communication in ICUIMC’12, 1–8. ACM.

    Google Scholar

    Oliveira C. M. & Pereira D. A. 2017. An association rules based method for classifying product offers from e-shopping. Intelligent Data Analysis 21(3), 637–660.

    Google Scholar

    Mazur E., Li B., Diao Y., McGregor A. & Shenoy P. 2012. SCALLA: a platform for scalable one-pass analytics using MapReduce. ACM Transactions on Database Systems 37(4), 27.

    Google Scholar

    Qiu H., Gu R., Yuan C. & Huang Y. 2014. Yafim: a parallel frequent itemset mining algorithm with Spark. In Proceedings of the 28th IEEE International Distributed Processing Symposium Workshops, IPDPSW’14, 1664–1671.

    Google Scholar

    Rathee S., Kaul M. & Kashyap A. 2015. R-Apriori: an efficient apriori based algorithm on Spark. In Proceedings of the 8th Workshop in Information and Knowledge Management, CIKM’15, 27–34. ACM.

    Google Scholar

    SINTEF 2013. Big Data, for better or worse: 90% of world’s data generated over last two years. www.sciencedaily.com/releases/2013/05/130522085217.htm, accessed January 22, 2016.

    Google Scholar

    Wedyan S. 2014. Review and comparison of associative classification data mining approaches. International Journal of Computer, Electrical, Automation, Control and Information Engineering 8(1), 34–45.

    Google Scholar

    White T. 2015. Hadoop: The Definitive Guide, 4th edition. O’Reilly Media.

    Google Scholar

    Witten I. H., Frank E. & Hall M. A. 2011. Data Mining – Practical Machine Learning Tools and Techniques. Morgan Kaufmann.

    Google Scholar

    Yahya O., Hegazy O. & Ezat E. 2012. An efficient implementation of Apriori algorithm based on Hadoop-MapReduce model. International Journal of Reviews in Computing 12, 59–67.

    Google Scholar

    Yang X. Y., Liu Z. & Fu Y. 2010. MapReduce as a programming model for association rules algorithm on Hadoop. In Proceedings of the 3rd Conference on Information Sciences and Interaction Sciences, ICIS’10, 99–102. IEEE.

    Google Scholar

    Zaharia M., Chowdhury M., Franklin M. J., Shenker S. & Stoica I. 2010. Spark: cluster computing with working sets. In Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing in HotCloud’10, 1–7. USENIX Association.

    Google Scholar

    Zhou X. & Huang Y. 2014. An improved parallel association rules algorithm based on MapReduce framework for big data. In Proceedings of the 11th Conference on Fuzzy Systems and Knowledge Discovery, FSKD’14, 284–288. IEEE.

    Google Scholar

  • Cite this article

    Eduardo P. S. Castro, Thiago D. Maia, Marluce R. Pereira, Ahmed A. A. Esmin, Denilson A. Pereira. 2018. Review and comparison of Apriori algorithm implementations on Hadoop-MapReduce and Spark. The Knowledge Engineering Review 33(1), doi: 10.1017/S0269888918000127
    Eduardo P. S. Castro, Thiago D. Maia, Marluce R. Pereira, Ahmed A. A. Esmin, Denilson A. Pereira. 2018. Review and comparison of Apriori algorithm implementations on Hadoop-MapReduce and Spark. The Knowledge Engineering Review 33(1), doi: 10.1017/S0269888918000127

Article Metrics

Article views(37) PDF downloads(50)

RESEARCH ARTICLE   Open Access    

Review and comparison of Apriori algorithm implementations on Hadoop-MapReduce and Spark

Abstract: Abstract: Several Apriori algorithm implementations for mining association rules have been proposed in the literature using the Hadoop-MapReduce framework and, more recently, Spark. However, none of the works have made a detailed assessment of its performance, for example, comparing it with other implementations in various characteristics of data sets. In this work, we present a review of the main algorithms proposed for Hadoop-MapReduce and compared their implementations in a single environment under several different situations. Moreover, these algorithms had their implementations adapted to Spark, and also compared under the same circumstances. Based on the results of the experiments, we present a framework for recommending the Apriori implementation most appropriate for solving a given problem, according to the data set characteristics and minimum required support. The results show that Spark implementations overcome Hadoop-MapReduce implementations at runtime in most experiments. However, there is no single implementation that is the best in all the evaluated situations.

    • This work was partially supported by CNPq, FAPEMIG grant CEX-APQ-01834-14 and an individual scholarship from CAPES.

    • Language created in 1960 (http://history.siam.org/sup/Fox_1960_LISP.pdf).

    • © Cambridge University Press, 2018 2018Cambridge University Press
References (25)
  • About this article
    Cite this article
    Eduardo P. S. Castro, Thiago D. Maia, Marluce R. Pereira, Ahmed A. A. Esmin, Denilson A. Pereira. 2018. Review and comparison of Apriori algorithm implementations on Hadoop-MapReduce and Spark. The Knowledge Engineering Review 33(1), doi: 10.1017/S0269888918000127
    Eduardo P. S. Castro, Thiago D. Maia, Marluce R. Pereira, Ahmed A. A. Esmin, Denilson A. Pereira. 2018. Review and comparison of Apriori algorithm implementations on Hadoop-MapReduce and Spark. The Knowledge Engineering Review 33(1), doi: 10.1017/S0269888918000127
  • Catalog

      /

      DownLoad:  Full-Size Img  PowerPoint
      Return
      Return