Extending Spark Analytics through Tika-Based Information Extraction and Retrieval

Author

Rishi Verma;Chris Mattmann

Author_Institution

Comput. Sci. Dept., Univ. of Southern California, Los Angeles, CA, USA

fYear

2015

Firstpage

215

Lastpage

218

Abstract

In this paper, we focus on techniques to merge the parallelized data processing (i.e. map-reduce) capabilities of Apache Spark with the extensive file-type parsing support of Apache Tika. These two frameworks each have unique appeal for data scientists. Where Spark makes highly efficient the parallelized processing of very large, often text-based data sets, Tika makes consistent the information extraction of over 1,200 text and binary file types on a sequential file basis. The technical integration of these two frameworks is the subject of our investigation, and is relevant for data scientists pursuing two types of use cases: (1) analysis of numerous (1000x) un-partitioned small to medium sized Tika parse-able files, and (2) analysis of very large partition-able Tika parse-able files. Given Tika´s niche specialization of file extraction and Spark´s specialization of parallelized computing, there is a need to explore the benefits of integration. Thus, we investigate best practices so as to empower data scientists with tools to gain insight into a greater portion of data formats commonly in use.

Keywords

"Sparks","Data mining","Information retrieval","Antarctica","Portable document format","Libraries","Meteorology"

Publisher

ieee

Conference_Titel

Information Reuse and Integration (IRI), 2015 IEEE International Conference on

Type

conf

DOI

10.1109/IRI.2015.43

Filename

7300979