Search
2016 Volume 31
Article Contents
RESEARCH ARTICLE   Open Access    

Data mining for building knowledge bases: techniques, architectures and applications

More Information
  • Abstract: Data mining techniques for extracting knowledge from text have been applied extensively to applications including question answering, document summarisation, event extraction and trend monitoring. However, current methods have mainly been tested on small-scale customised data sets for specific purposes. The availability of large volumes of data and high-velocity data streams (such as social media feeds) motivates the need to automatically extract knowledge from such data sources and to generalise existing approaches to more practical applications. Recently, several architectures have been proposed for what we call knowledge mining: integrating data mining for knowledge extraction from unstructured text (possibly making use of a knowledge base), and at the same time, consistently incorporating this new information into the knowledge base. After describing a number of existing knowledge mining systems, we review the state-of-the-art literature on both current text mining methods (emphasising stream mining) and techniques for the construction and maintenance of knowledge bases. In particular, we focus on mining entities and relations from unstructured text data sources, entity disambiguation, entity linking and question answering. We conclude by highlighting general trends in knowledge mining research and identifying problems that require further research to enable more extensive use of knowledge bases.
  • 加载中
  • Agarwal A., Chapelle O., Dudík M. & Langford J.2014. A reliable effective terascale linear learning system. Journal of Machine Learning Research15, 1111–1133.

    Google Scholar

    Aggarwal C. C. & Zhai C.2012. Mining Text Data. Springer.

    Google Scholar

    Agichtein E. & Gravano L.2000. Snowball: extracting relations from large plain-text collections. In Proceedings of the Fifth ACM Conference on Digital Libraries, 85–94.

    Google Scholar

    Agrawal R. & Srikant R.1995. Mining sequential patterns. In Proceedings of the Eleventh International Conference on Data Engineering, 3–14.

    Google Scholar

    Althoff T., Dong X. L., Murphy K., Alai S., Dang V. & Zhang W.2015. TimeMachine: timeline generation for knowledge-base entities. In Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 19–28.

    Google Scholar

    Angeli G., Gupta S., Premkumar M. J., Manning C. D., Ré C., Tibshirani J., Wu J. Y., Wu S. & Zhang C.2014. Stanford’s distantly supervised slot filling systems for KBP 2014. In Proceedings of the Seventh Text Analysis Conference.

    Google Scholar

    Antoniou G. & van Harmelen F.2009. Web ontology language (OWL). In Handbook on Ontologies, Staad S. & Studer R. (eds). Springer, 91–110.

    Google Scholar

    Asr F. T., Sonntag J., Grishina Y. & Stede M.2014. Conceptual and practical steps in event coreference analysis of large-scale data. In Proceedings of the Second Workshop on EVENTS: Definition, Detection, Coreference and Representation, 35–44.

    Google Scholar

    Baena-García M., del Campo-Ávila J., Fidalgo R., Bifet A., Gavaldà R. & Morales-Bueno R.2004. Early drift detection method. In Proceedings of the Fourth International Workshop on Knowledge Discovery from Data Streams, 77–86.

    Google Scholar

    Becker H., Iter D., Naaman M. & Gravano L.2012. Identifying content for planned events across social media sites. In Proceedings of the Fifth ACM International Conference on Web Search and Data Mining, 533–542.

    Google Scholar

    Becker H., Naaman M. & Gravano L.2011. Beyond Trending Topics: Real-World Event Identification on Twitter. Technical report CUCS-012-11, Department of Computer Science, Columbia University.

    Google Scholar

    Beltagy I., Erk K. & Mooney R.2014. Probabilistic soft logic for semantic textual similarity. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, 1210–1219.

    Google Scholar

    Berant J., Chou A., Frostig R. & Liang P.2013. Semantic parsing on Freebase from question-answer pairs. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, 1533–1544.

    Google Scholar

    Biemann C.2005. Ontology learning from text: a survey of methods. Journal for Language Technology and Computational Linguistics20, 75–93.

    Google Scholar

    Bifet A. & Gavaldà R.2006. Learning from time-changing data with adaptive windowing. In Proceedings of the Sixth SIAM International Conference on Data Mining, 443–448.

    Google Scholar

    Blei D. M., Ng A. Y. & Jordan M. I.2003. Latent Dirichlet allocation. Journal of Machine Learning Research3, 993–1022.

    Google Scholar

    Bollacker K., Tufts P., Pierce T. & Cook R.2007. A platform for scalable, collaborative, structured information integration. In Proceedings of the Sixth International Workshop on Information Integration on the Web, 22–27.

    Google Scholar

    Bröcheler M., Mihalkova L. & Getoor L.2010. Probabilistic similarity logic. In Proceedings of the Twenty-Sixth Annual Conference on Uncertainty in Artificial Intelligence, 73–82.

    Google Scholar

    Brunzel M2008. The XTREEM methods for ontology learning from web documents. In Ontology Learning and Population: Bridging the Gap Between Text and Knowledge, Buitelaar P. & Cimiano P. (eds). IOS Press, 3–26.

    Google Scholar

    Carlson A., Betteridge J., Kisiel B., Settles B., Hruschka E. R. & Mitchell T. M.2010. Toward an architecture for never-ending language learning. In Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence, 1306–1313.

    Google Scholar

    Chai X., Deshpande O., Garera N., Gattani A., Lam W., Lamba D. S., Liu L., Tiwari M., Tourn M., Vacheri Z., Prasad S. T. S., Subramaniam S., Harinarayan V., Rajaraman A., Ardalan A., Das S., Suganthan G. C. P. & Doan A.2013. Social media analytics: the Kosmix story. IEEE Data Engineering Bulletin36, 4–12.

    Google Scholar

    Chen Y. & Wang D. Z.2014. Knowledge expansion over probabilistic knowledge bases. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, 649–660.

    Google Scholar

    Chen Z. & Ji H.2011. Collaborative ranking: a case study on entity linking. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, 771–781.

    Google Scholar

    Cheng Z., Caverlee J. & Lee K.2010. You are where you tweet: a content-based approach to geo-locating Twitter users. In Proceedings of the 19th ACM International Conference on Information and Knowledge Management, 759–768.

    Google Scholar

    Cimiano P., Lopez V., Unger C., Cabrio E., Ngomo A.-C. N. & Walter S.2013. Multilingual Question Answering over Linked Data (QALD-3): lab overview. In Information Access Evaluation. Multilinguality, Multimodality, and Visualization, Forner P., Müller H., Paredes R., Rosso P. & Stein B. (eds). Springer-Verlag, 321–332.

    Google Scholar

    Clarke J., Merhav Y., Suleiman G., Zheng S. & Murgatroyd D.2012. Basis technology at TAC 2012 entity linking. In Proceedings of the Fifth Text Analysis Conference.

    Google Scholar

    Compton P. & Jansen R.1990. A philosophical basis for knowledge acquisition. Knowledge Acquisition2, 241–258.

    Google Scholar

    Cortes C. & Vapnik V.1995. Support-vector networks. Machine Learning20, 273–297.

    Google Scholar

    Curran J. R., Murphy T. & Scholz B.2007. Minimising semantic drift with mutual exclusion bootstrapping. In Proceedings of the Tenth Conference of the Pacific Association for Computational Linguistics, 172–180.

    Google Scholar

    Davis A., Veloso A., da Silva A. S., Meira W. J., & Laender A. H. F.2012. Named entity disambiguation in streaming data. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers , 1, 815–824.

    Google Scholar

    Dellschaft K. & Staab S.2006. On how to perform a gold standard based evaluation of ontology learning. In Proceedings of the 5th International Conference on the Semantic Web, 228–241.

    Google Scholar

    Deshpande O., Lamba D. S., Tourn M., Das S., Subramaniam S., Rajaraman A., Harinarayan V. & Doan A.2013. Building, maintaining, and using knowledge bases: a report from the trenches. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, 1209–1220.

    Google Scholar

    Dong X., Gabrilovich E., Heitz G., Horn W., Lao N., Murphy K., Strohmann T., Sun S. & Zhang W.2014. Knowledge vault: a web-scale approach to probabilistic knowledge fusion. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 601–610.

    Google Scholar

    Etzioni O., Cafarella M., Downey D., Popescu A.-M., Shaked T., Soderland S., Weld D. S. & Yates A.2005. Unsupervised named-entity extraction from the web: an experimental study. Artificial Intelligence165, 91–134.

    Google Scholar

    Fan J., Kalyanpur A., Gondek D. C. & Ferrucci D. A.2012. Automatic knowledge extraction from documents. IBM Journal of Research and Development56, 5:1–5:10.

    Google Scholar

    Fayyad U., Piatetsky-Shapiro G. & Smyth P.1996. From data mining to knowledge discovery in databases. AI Magazine17, 37–54.

    Google Scholar

    Ferré S.2013. Squall2sparql: a translator from controlled English to full SPARQL 1.1. In Proceedings of the Question Answering over Linked Data (QALD-3).

    Google Scholar

    Ferrucci D. A.2012. Introduction to ‘This is Watson’. IBM Journal of Research and Development56, 1:1–1:15.

    Google Scholar

    Fung G. P. C., Yu J. X., Yu P. S. & Lu H.2005. Parameter free bursty events detection in text streams. In Proceedings of the 31st International Conference on Very Large Data Bases, 181–192.

    Google Scholar

    Furht B. & Escalante A.2011. Handbook of Data Intensive Computing. Springer Science & Business Media.

    Google Scholar

    Gama J.2012. A survey on learning from data streams: current and future trends. Progress in Artificial Intelligence1, 45–55.

    Google Scholar

    Gama J., Medas P., Castillo G. & Rodrigues P.2004. Learning with drift detection. In Advances in Artificial Intelligence, Bazzan A. L. C. & Labidi S. (eds). Springer-Verlag, 66–112.

    Google Scholar

    Gama J., Žliobaitė I., Bifet A., Pechenizkiy M. & Bouchachia A.2014. A survey on concept drift adaptation. ACM Computing Surveys (CSUR)46, 44.

    Google Scholar

    Gao D., Li X. C. W., Zhang R. & Ouyang Y.2014. Sequential summarization: a full view of Twitter trending topics. IEEE Transactions on Knowledge and Data Engineering22, 296–302.

    Google Scholar

    Gattani A., Lamba D. S., Garera N., Tiwari M., Chai X., Das S., Subramaniam S., Rajaraman A., Harinarayan V. & Doan A.2013. Entity extraction, linking, classification, and tagging for social media: a Wikipedia-based approach. Proceedings of the VLDB Endowment6, 1126–1137.

    Google Scholar

    Geng L. & Hamilton H. J.2006. Interestingness measures for data mining: a survey. ACM Computing Surveys (CSUR)38, 1–32.

    Google Scholar

    Gruber T. R.1993. A translation approach to portable ontology specifications. Knowledge Acquisition5, 199–220.

    Google Scholar

    Guo W., Li H., Ji H. & Diab M. T.2013. Linking tweets to news: a framework to enrich short text data in social media. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, 239–248.

    Google Scholar

    Gupta A., Mumick I. S. & Subrahmanian V. S.1993. Maintaining views incrementally. ACM SIGMOD Record22, 157–166.

    Google Scholar

    Han J., Kamber M. & Pei J.2011. Data Mining: Concepts and Techniques. Elsevier.

    Google Scholar

    He S., Liu S., Chen Y., Zhou G., Liu K. & Zhao J.2013. CASIA@QALD-3: a question answering system over linked data. In Proceedings of the Question Answering over Linked Data (QALD-3).

    Google Scholar

    Ho V. H., Wobcke W. & Compton P.2003. EMMA: an e-mail management assistant. In Proceedings of the 2003 IEEE/WIC International Conference on Intelligent Agent Technology, 67–74.

    Google Scholar

    Hoffart J., Suchanek F. M., Berberich K. & Weikum G.2013. YAGO2: a spatially and temporally enhanced knowledge base from Wikipedia. Artificial Intelligence194, 28–61.

    Google Scholar

    Hoffmann R., Zhang C., Ling X., Zettlemoyer L. & Weld D. S.2011. Knowledge-based weak supervision for information extraction of overlapping relations. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, 541–550.

    Google Scholar

    Hua W., Wang Z., Wang H., Zheng K. & Zhou X.2015. Short text understanding through lexical-semantic analysis. In 2015 IEEE 31st International Conference on Data Engineering (ICDE), 495–506.

    Google Scholar

    Huang H., Cao Y., Huang X., Ji H. & Lin C.-Y.2014. Collective tweet wikification based on semi-supervised graph regularization. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, 380–389.

    Google Scholar

    Huang R. & Riloff E.2013. Multi-faceted event recognition with bootstrapped dictionaries. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 41–51.

    Google Scholar

    Hulten G., Spencer L. & Domingos P.2001. Mining time-changing data streams. In Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 97–106.

    Google Scholar

    Ji H. & Grishman R.2011. Knowledge base population: successful approaches and challenges. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, 1, 1148–1158.

    Google Scholar

    Ji H., Grishman R. & Dang H. T.2011. Overview of the TAC 2011 knowledge base population track. In Proceedings of the Fourth Text Analysis Conference.

    Google Scholar

    Ji H., Grishman R., Dang H. T., Griffitt K. & Ellis J.2010. Overview of the TAC 2010 knowledge base population track. In Proceedings of the Third Text Analysis Conference.

    Google Scholar

    Kim M. H. & Compton P.2012a. Improving open information extraction for informal web documents with ripple-down rules. In Knowledge Management and Acquisition for Intelligent Systems, Richards D. & Kang B. H. (eds). Springer-Verlag, 160–174.

    Google Scholar

    Kim M. H. & Compton P.2012b. Improving the performance of a named entity recognition system with knowledge acquisition. In Proceedings of the 18th International Conference on Knowledge Engineering and Knowledge Management, 97–113.

    Google Scholar

    Kotov A., Zhai C. & Sproat R.2011. Mining named entities with temporally correlated bursts from multilingual web news streams. In Proceedings of the Fourth ACM International Conference on Web Search and Data Mining, 237–246.

    Google Scholar

    Koychev I.2000. Gradual forgetting for adaptation to concept drift. In Proceedings of the ECAI Workshop Current Issues in Spatio-Temporal Reasoning, 101–106.

    Google Scholar

    Krzywicki A. & Wobcke W.2010. Exploiting concept clumping for efficient incremental e-mail categorization. In Advanced Data Mining and Applications, Cao L., Feng Y. & Zhong J. (eds). Springer-Verlag, 244–258.

    Google Scholar

    Krzywicki A. & Wobcke W.2011. Exploiting concept clumping for efficient incremental news article categorization. In Advanced Data Mining and Applications, Tang J., King I., Chen L. & Wang J. (eds). Springer-Verlag, 353–366.

    Google Scholar

    Kumar R., Raghavan P., Rajagopalan S. & Tomkins A.1999. Extracting large-scale knowledge bases from the web. In Proceedings of the 25th International Conference on Very Large Data Bases, 639–650.

    Google Scholar

    Lafferty J. D., McCallum A. & Pereira F. C. N.2001. Conditional random fields: probabilistic models for segmenting and labeling sequence data. In Proceedings of the Eighteenth International Conference on Machine Learning, 282–289. Morgan Kaufmann Publishers.

    Google Scholar

    Levenshtein V. I.1966. Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady10, 707–710.

    Google Scholar

    Li J., Wang G. A. & Chen H.2011. Identity matching using personal and social identity features. Information Systems Frontiers13, 101–113.

    Google Scholar

    Li Y., Wang C., Han F., Han J., Roth D. & Yan X.2013. Mining evidences for named entity disambiguation. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1070–1078.

    Google Scholar

    Liu X., Li Y., Wu H., Zhou M., Wei F. & Lu Y.2013. Entity linking for tweets. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, 1304–1311.

    Google Scholar

    Liu X., Zhang S., Wei F. & Zhou M.2011. Recognizing named entities in tweets. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, 1, 359–367.

    Google Scholar

    Maynard D., Li Y. & Peters W.2008. NLP techniques for term extraction and ontology population. In Ontology Learning and Population: Bridging the Gap Between Text and Knowledge, Buitelaar P. & Cimiano P. (eds). IOS Press, 107–127.

    Google Scholar

    McGarry K.2005. A survey of interestingness measures for knowledge discovery. The Knowledge Engineering Review20, 39–61.

    Google Scholar

    Mendes P. N., Jakob M. & Bizer C.2012. DBpedia: a multilingual cross-domain knowledge base. In Proceedings of the Eighth International Conference on Language Resources and Evaluation, 1813–1817.

    Google Scholar

    Mintz M., Bills S., Snow R. & Jurafsky D.2009. Distant supervision for relation extraction without labeled data. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, 1003–1011.

    Google Scholar

    Mitchell T., Cohen W., Hruschka E., Talukdar P., Betteridge J., Carlson A., Dalvi B., Gardner M., Kisiel B., Krishnamurthy J., Lao N., Mazaitis K., Mohamed T., Nakashole N., Platanios E., Ritter A., Samadi M., Settles B., Wang R., Wijaya D., Gupta A., Chen X., Saparov A., Greaves M. & Welling J.2015. Never-ending learning. In Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, 2302–2310.

    Google Scholar

    Monahan S. & Brunson M.2014. Qualities of eventiveness. In Proceedings of the Second Workshop on EVENTS: Definition, Detection, Coreference and Representation, 59–67.

    Google Scholar

    Monahan S., Lehmann J., Nyberg T., Plymale J. & Jung A.2011. Cross-lingual cross-document coreference with entity linking. In Proceedings of the Fourth Text Analysis Conference.

    Google Scholar

    Napoles C., Gormley M. & Van Durme B.2012. Annotated Gigaword. In Proceedings of the Joint Workshop on Automatic Knowledge Base Construction and Web-scale Knowledge Extraction, 95–100.

    Google Scholar

    Nasukawa T. & Nagano T.2001. Text analysis and knowledge mining system. IBM Systems Journal40, 967–984.

    Google Scholar

    Nenkova A. & McKeown K.2012. A Survey of Text Summarization Techniques. In Mining Text Data. Aggarwal C. C. and Zhai C. (eds). Springer Science+Business Media, 43–76.

    Google Scholar

    Ottens K., Aussenac-Gilles N., Gleizes M. P. & Camps V.2007. Dynamic ontology co-evolution from texts: principles and case study. In Proceedings of the International Workshop on Emergent Semantics and Ontology Evolution, 70–83.

    Google Scholar

    Pan J. Z.2009. Resource description framework. In Handbook on Ontologies, Staad S. & Studer R. (eds). Springer, 71–90.

    Google Scholar

    Park S. S., Kim Y. S. & Kang B. H.2004. Personalized web document classification using MCRDR. In Proceedings of the Pacific Knowledge Acquisition Workshop 2004, 63–73.

    Google Scholar

    Pham S. B. & Hoffmann A.2005. Incremental knowledge acquisition for extracting temporal relations. In Proceedings of the 2005 IEEE International Conference on Natural Language Processing and Knowledge Engineering, 354–359.

    Google Scholar

    Ramakrishnan N., Butler P., Muthiah S., Self N., Khandpur R., Saraf P., Wang W., Cadena J., Vullikanti A., Korkmaz G., Kuhlman C., Marathe A., Zhao L., Hua T., Chen F., Lu C.-T., Huang B., Srinivasan A., Trinh K., Getoor L., Katz G., Doyle A., Ackermann C., Zavorin I., Ford J., Summers K., Fayed Y., Arredondo J., Gupta D. & Mares D.2014. ‘Beating the news’ with EMBERS: forecasting civil unrest using open source indicators. In Proceedings of the Twentieth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1799–1808.

    Google Scholar

    Ré C., Sadeghian A. A., Shan Z., Shin J., Wang F., Wu S. & Zhang C.2014. Feature Engineering for Knowledge Base Construction. Data Engineering Bulletin37, 26–40.

    Google Scholar

    Riloff E. & Jones R.1999. Learning dictionaries for information extraction by multi-level bootstrapping. In Proceedings of the Sixteenth National Conference on Artificial Intelligence and the Eleventh Innovative Applications of Artificial Intelligence Conference Innovative Applications of Artificial Intelligence, 474–479.

    Google Scholar

    Ritter A., Clark S., Mausam & Etzioni O.2011. Named entity recognition in tweets: an experimental study. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, 1524–1534.

    Google Scholar

    Roth B., Barth T., Wiegand M., Singh M. & Klakow D.2013. Effective slot filling based on shallow distant supervision methods. In Proceedings of the Sixth Text Analysis Conference.

    Google Scholar

    Rusu D., Hodson J. & Kimball A.2014. Unsupervised techniques for extracting and clustering complex events in news. In Proceedings of the Second Workshop on EVENTS: Definition, Detection, Coreference and Representation, 26–34.

    Google Scholar

    Schrodt P. A., Davis S. G. & Weddle J. L.1994. Political science: KEDS—a program for the machine coding of event data. Social Science Computer Review12, 561–587.

    Google Scholar

    Shin J., Wu S., Wang F., Sa C. D., Zhang C. & Ré C.2015. Incremental knowledge base construction using DeepDive. Proceedings of the VLDB Endowment8, 1310–1321.

    Google Scholar

    Silva L. D. & Riloff E.2014. User type classification of tweets with implications for event recognition. In Proceedings of the Joint Workshop on Social Dynamics and Personal Attributes in Social Media, 98–108.

    Google Scholar

    Stoyanov V., Xu J., Oard D., Lawrie D. & Finin T.2012. A context-aware approach to entity linking. In Proceedings of the NAACL Joint Workshop on Automatic Knowledge Base Construction and Web-Scale Knowledge Extraction, 62–67.

    Google Scholar

    Suganthan G. C, Sun P. C., Krishna Gayatri K., Zhang H., Yang F., Rampalli N., Prasad S., Arcaute E., Krishnan G., Deep R., Raghavendra V. & Doan A.2015. Why big data industrial systems need rules and what we can do about it. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, 265–276.

    Google Scholar

    Surdeanu M.2013. Overview of the TAC 2013 knowledge base population evaluation: English slot filling and temporal slot filling. In Proceedings of the Sixth Text Analysis Conference.

    Google Scholar

    Tudorache T., Noy N. F., Tu S. & Musen M. A.2008. Supporting collaborative ontology development in protégé. In The Semantic Web − ISWC 2008, Sheth A., Staab S., Dean M., Paolucci M., Maynard D., Finin T. & Thirunarayan K. (eds). Springer-Verlag, 17–32.

    Google Scholar

    Unger C., Forascu C., Lopez V., Ngomo A.-C. N., Cabrio E., Cimiano P. & Walter S.2014. Question Answering over Linked Data (QALD-4). CLEF 2014 Working Notes, 1172–1180.

    Google Scholar

    Van Dyke Parunak H., Rohwer R., Belding T. & Brueckner S.2007. Dynamic decentralized any-time hierarchical clustering. In Engineering Self-Organising Systems, Brueckner S., Hassas S., Jelasity M. & Yamins D. (eds). Springer-Verlag, 66–81.

    Google Scholar

    Veloso A., Meira W. Jr. & Zaki M. J.2006. Lazy associative classification. In Proceedings of the Sixth International Conference on Data Mining, 645–654.

    Google Scholar

    Volker J., Haase P. & Hitzler P.2008. Learning expressive ontologies. In Ontology Learning and Population: Bridging the Gap Between Text and Knowledge, Buitelaar P. & Cimiano P. (eds). IOS Press, 45–69.

    Google Scholar

    Wang Z., Zhao K., Wang H., Meng X. & Wen J.-R.2015. Query understanding through knowledge-based conceptualization. In Proceedings of the International Joint Conference on Artificial Intelligence, 3264–3270.

    Google Scholar

    Widmer G.1997. Tracking context changes through meta-learning. Machine Learning27, 259–286.

    Google Scholar

    Witten I. H., Frank E. & Hall M. A.2011. Data Mining: Practical Machine Learning Tools and Techniques, 3rd edition. Morgan Kaufmann Publishers.

    Google Scholar

    Wobcke W., Krzywicki A. & Chan Y.-W.2008. A large-scale evaluation of an e-mail management assistant. In Proceedings of the 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology, 438–442.

    Google Scholar

    Yang Y., Carbonell J. G., Brown R. D., Pierce T., Archibald B. T. & Liu X.1999. Learning approaches for detecting and tracking news events. IEEE Intelligent Systems14, 32–43.

    Google Scholar

    Yao X. & Van Durme B.2014. Information extraction over structured data: question answering with Freebase. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, 956–965.

    Google Scholar

    Yu D., Li H., Cassidy T., Li Q., Huang H., Chen Z., Ji H., Zhang Y. & Roth D.2013. RPI-BLENDER TAC-KBP2013 knowledge base population system. In Proceedings of the Sixth Text Analysis Conference.

    Google Scholar

    Zacks J. M. & Tversky B.2001. Event structure in perception and conception. Psychological Bulletin127, 3–21.

    Google Scholar

    Zhang W., Su J., Chen B., Wang W., Toh Z., Sim Y., Cao Y., Lin C. Y. & Tan C. L.2011. I2R-NUS-MSRA at TAC 2011: entity linking. In Proceedings of the Fourth Text Analysis Conference.

    Google Scholar

    Zhu J., Nie Z., Liu X., Zhang B. & Wen J.-R.2009. StatSnowball: a statistical approach to extracting entity relationships. In Proceedings of the 18th International Conference on World Wide Web, 101–110.

    Google Scholar

    Zou L., Huang R., Wang H., Yu J. X., He W. & Zhao D.2014. Natural language question answering over RDF: a graph data driven approach. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, 313–324.

    Google Scholar

  • Cite this article

    Alfred Krzywicki, Wayne Wobcke, Michael Bain, John Calvo Martinez, Paul Compton. 2016. Data mining for building knowledge bases: techniques, architectures and applications. The Knowledge Engineering Review 31(2)97−123, doi: 10.1017/S0269888916000047
    Alfred Krzywicki, Wayne Wobcke, Michael Bain, John Calvo Martinez, Paul Compton. 2016. Data mining for building knowledge bases: techniques, architectures and applications. The Knowledge Engineering Review 31(2)97−123, doi: 10.1017/S0269888916000047

Article Metrics

Article views(20) PDF downloads(118)

RESEARCH ARTICLE   Open Access    

Data mining for building knowledge bases: techniques, architectures and applications

The Knowledge Engineering Review  31 2016, 31(2): 97−123  |  Cite this article

Abstract: Abstract: Data mining techniques for extracting knowledge from text have been applied extensively to applications including question answering, document summarisation, event extraction and trend monitoring. However, current methods have mainly been tested on small-scale customised data sets for specific purposes. The availability of large volumes of data and high-velocity data streams (such as social media feeds) motivates the need to automatically extract knowledge from such data sources and to generalise existing approaches to more practical applications. Recently, several architectures have been proposed for what we call knowledge mining: integrating data mining for knowledge extraction from unstructured text (possibly making use of a knowledge base), and at the same time, consistently incorporating this new information into the knowledge base. After describing a number of existing knowledge mining systems, we review the state-of-the-art literature on both current text mining methods (emphasising stream mining) and techniques for the construction and maintenance of knowledge bases. In particular, we focus on mining entities and relations from unstructured text data sources, entity disambiguation, entity linking and question answering. We conclude by highlighting general trends in knowledge mining research and identifying problems that require further research to enable more extensive use of knowledge bases.

    • This work was supported by Data to Decisions Cooperative Research Centre.

    • http://googleblog.blogspot.com.au/2012/05/introducing-knowledge-graph-things-not.html

    • http://www.cis.upenn.edu/~ccb/ppdb/

    • Muppet has been released as open-source software under the name of Mupd8.

    • http://www.cmu.edu/homepage/computing/2010/fall/nell-computer-that-learns.shtml

    • Severe Acute Respiratory Syndrome, a viral disease with flu-like symptoms, an outbreak of which occurred in southern China between November 2002 and July 2003, resulting in numerous deaths.

    • http://google-opensource.blogspot.com.au/2014/06/cayley-graphs-in-go.html

    • Comparisons of results for TAC 2012 and 2014 have not been published.

    • A semantic query language for retrieving and manipulating data in RDF format.

    • http://sourceforge.net/projects/zpar/

    • © Cambridge University Press, 2016 2016Cambridge University Press
References (116)
  • About this article
    Cite this article
    Alfred Krzywicki, Wayne Wobcke, Michael Bain, John Calvo Martinez, Paul Compton. 2016. Data mining for building knowledge bases: techniques, architectures and applications. The Knowledge Engineering Review 31(2)97−123, doi: 10.1017/S0269888916000047
    Alfred Krzywicki, Wayne Wobcke, Michael Bain, John Calvo Martinez, Paul Compton. 2016. Data mining for building knowledge bases: techniques, architectures and applications. The Knowledge Engineering Review 31(2)97−123, doi: 10.1017/S0269888916000047
  • Catalog

      /

      DownLoad:  Full-Size Img  PowerPoint
      Return
      Return