Title :
Storing, Indexing and Querying Large Provenance Data Sets as RDF Graphs in Apache HBase
Author :
Chebotko, Artem ; Abraham, Jibi ; Brazier, Pearl ; Piazza, A. ; Kashlev, Andrey ; Shiyong Lu
Author_Institution :
Dept. of Comput. Sci., Univ. of Texas - Pan American, Edinbug, TX, USA
fDate :
June 28 2013-July 3 2013
Abstract :
Provenance, which records the history of an in-silico experiment, has been identified as an important requirement for scientific workflows to support scientific discovery reproducibility, result interpretation, and problem diagnosis. Large provenance datasets are composed of many smaller provenance graphs, each of which corresponds to a single workflow execution. In this work, we explore and address the challenge of efficient and scalable storage and querying of large collections of provenance graphs serialized as RDF graphs in an Apache HBase database. Specifically, we propose: (i) novel storage and indexing techniques for RDF data in HBase that are better suited for provenance datasets rather than generic RDF graphs and (ii) novel SPARQL query evaluation algorithms that solely rely on indices to compute expensive join operations, make use of numeric values that represent triple positions rather than actual triples, and eliminate the need for intermediate data transfers over a network. The empirical evaluation of our algorithms using provenance datasets and queries of the University of Texas Provenance Benchmark confirms that our approach is efficient and scalable.
Keywords :
distributed databases; indexing; natural sciences computing; query processing; storage management; Apache HBase database; RDF graph; Resource Description Framework; SPARQL query evaluation algorithms; University of Texas provenance benchmark; join operations; large provenance data set indexing; large provenance data set querying; large provenance data set storage; provenance graphs; scientific discovery problem diagnosis; scientific discovery reproducibility; scientific discovery result interpretation; scientific workflows; Educational institutions; Indexing; Pattern matching; Query processing; Resource description framework; Vectors; HBase; RDF; SPARQL; big data; distributed database; provenance; query; scalability; scientific workflow;
Conference_Titel :
Services (SERVICES), 2013 IEEE Ninth World Congress on
Conference_Location :
Santa Clara, CA
Print_ISBN :
978-0-7695-5024-4
DOI :
10.1109/SERVICES.2013.32