• DocumentCode
    3673634
  • Title

    Extending Spark Analytics through Tika-Based Information Extraction and Retrieval

  • Author

    Rishi Verma;Chris Mattmann

  • Author_Institution
    Comput. Sci. Dept., Univ. of Southern California, Los Angeles, CA, USA
  • fYear
    2015
  • Firstpage
    215
  • Lastpage
    218
  • Abstract
    In this paper, we focus on techniques to merge the parallelized data processing (i.e. map-reduce) capabilities of Apache Spark with the extensive file-type parsing support of Apache Tika. These two frameworks each have unique appeal for data scientists. Where Spark makes highly efficient the parallelized processing of very large, often text-based data sets, Tika makes consistent the information extraction of over 1,200 text and binary file types on a sequential file basis. The technical integration of these two frameworks is the subject of our investigation, and is relevant for data scientists pursuing two types of use cases: (1) analysis of numerous (1000x) un-partitioned small to medium sized Tika parse-able files, and (2) analysis of very large partition-able Tika parse-able files. Given Tika´s niche specialization of file extraction and Spark´s specialization of parallelized computing, there is a need to explore the benefits of integration. Thus, we investigate best practices so as to empower data scientists with tools to gain insight into a greater portion of data formats commonly in use.
  • Keywords
    "Sparks","Data mining","Information retrieval","Antarctica","Portable document format","Libraries","Meteorology"
  • Publisher
    ieee
  • Conference_Titel
    Information Reuse and Integration (IRI), 2015 IEEE International Conference on
  • Type

    conf

  • DOI
    10.1109/IRI.2015.43
  • Filename
    7300979