• DocumentCode
    1791781
  • Title

    Integrating Data Mining and Data Management Technologies for Scholarly Inquiry

  • Author

    Larson, Ray R. ; Marciano, Richard ; Chien-Yi Hou ; Shreyas ; Watry, Paul ; Harrison, Jonathan ; Aguilar, Luis ; Fuselier, Jerome

  • Author_Institution
    Sch. of Inf., Univ. of California, Berkeley, Berkeley, CA, USA
  • fYear
    2014
  • fDate
    27-30 Oct. 2014
  • Firstpage
    67
  • Lastpage
    71
  • Abstract
    This short paper discusses the “Integrating Data Mining and Data Management Technologies for Scholarly Inquiry” project. In this “Round Two” Digging Into Data Challenge award, we explored uses and approaches for large-scale data analysis and processing for the Humanities and Social Sciences through the integration of several infrastructure frameworks: Cheshire, iRODS, and Amazon Web Services (EC2 computing and S3 storage). Our “big data” consisted of the entire texts collection of the Internet Archive (approximately 3.6 million volumes) and the entire JSTOR database. We performed surface-level natural language processing on this data to identify noun phrases and further refinements to identify personal, corporate, and geographic names. We then used resources including library and archival authority records to identify variants and merge names. The goal is to create an integrated index of persons, places, and organizations referenced in our collections.
  • Keywords
    Big Data; Web services; data mining; merging; natural language processing; text analysis; Amazon Web Services; Big Data; Cheshire; Data Challenge award; EC2 computing; Internet archive; JSTOR database; S3 storage; archival authority records; corporate name identification; data management technology integration; data mining technology integration; geographic name identification; humanities; iRODS; integrated person-place-organization index; large-scale data analysis; large-scale data processing; library records; name merging; noun phrases; personal name identification; scholarly inquiry; social sciences; surface-level natural language processing; text collection; Data mining; Educational institutions; Indexing; Internet; Libraries; Prototypes; XML; Cheshire3; Internet Archive; JSTOR; big data; data management; data mining; iRODS; natural language processing;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Big Data (Big Data), 2014 IEEE International Conference on
  • Conference_Location
    Washington, DC
  • Type

    conf

  • DOI
    10.1109/BigData.2014.7004455
  • Filename
    7004455