DocumentCode
3673634
Title
Extending Spark Analytics through Tika-Based Information Extraction and Retrieval
Author
Rishi Verma;Chris Mattmann
Author_Institution
Comput. Sci. Dept., Univ. of Southern California, Los Angeles, CA, USA
fYear
2015
Firstpage
215
Lastpage
218
Abstract
In this paper, we focus on techniques to merge the parallelized data processing (i.e. map-reduce) capabilities of Apache Spark with the extensive file-type parsing support of Apache Tika. These two frameworks each have unique appeal for data scientists. Where Spark makes highly efficient the parallelized processing of very large, often text-based data sets, Tika makes consistent the information extraction of over 1,200 text and binary file types on a sequential file basis. The technical integration of these two frameworks is the subject of our investigation, and is relevant for data scientists pursuing two types of use cases: (1) analysis of numerous (1000x) un-partitioned small to medium sized Tika parse-able files, and (2) analysis of very large partition-able Tika parse-able files. Given Tika´s niche specialization of file extraction and Spark´s specialization of parallelized computing, there is a need to explore the benefits of integration. Thus, we investigate best practices so as to empower data scientists with tools to gain insight into a greater portion of data formats commonly in use.
Keywords
"Sparks","Data mining","Information retrieval","Antarctica","Portable document format","Libraries","Meteorology"
Publisher
ieee
Conference_Titel
Information Reuse and Integration (IRI), 2015 IEEE International Conference on
Type
conf
DOI
10.1109/IRI.2015.43
Filename
7300979
Link To Document