DocumentCode :
3673634
Title :
Extending Spark Analytics through Tika-Based Information Extraction and Retrieval
Author :
Rishi Verma;Chris Mattmann
Author_Institution :
Comput. Sci. Dept., Univ. of Southern California, Los Angeles, CA, USA
fYear :
2015
Firstpage :
215
Lastpage :
218
Abstract :
In this paper, we focus on techniques to merge the parallelized data processing (i.e. map-reduce) capabilities of Apache Spark with the extensive file-type parsing support of Apache Tika. These two frameworks each have unique appeal for data scientists. Where Spark makes highly efficient the parallelized processing of very large, often text-based data sets, Tika makes consistent the information extraction of over 1,200 text and binary file types on a sequential file basis. The technical integration of these two frameworks is the subject of our investigation, and is relevant for data scientists pursuing two types of use cases: (1) analysis of numerous (1000x) un-partitioned small to medium sized Tika parse-able files, and (2) analysis of very large partition-able Tika parse-able files. Given Tika´s niche specialization of file extraction and Spark´s specialization of parallelized computing, there is a need to explore the benefits of integration. Thus, we investigate best practices so as to empower data scientists with tools to gain insight into a greater portion of data formats commonly in use.
Keywords :
"Sparks","Data mining","Information retrieval","Antarctica","Portable document format","Libraries","Meteorology"
Publisher :
ieee
Conference_Titel :
Information Reuse and Integration (IRI), 2015 IEEE International Conference on
Type :
conf
DOI :
10.1109/IRI.2015.43
Filename :
7300979
Link To Document :
بازگشت