Title :
Scan-Sharing for Optimizing RDF Graph Pattern Matching on MapReduce
Author :
Kim, HyeongSik ; Ravindra, Padmashree ; Anyanwu, Kemafor
Author_Institution :
North Carolina State Univ., Raleigh, NC, USA
Abstract :
Recently, the number and size of RDF data collections has increased rapidly making the issue of scalable processing techniques crucial. The MapReduce model has become a de facto standard for large scale data processing using a cluster of machines in the cloud. Generally, RDF query processing creates join-intensive workloads, resulting in lengthy MapReduce workflows with expensive I/O, data transfer, and sorting costs. However, the MapReduce computation model provides limited static optimization techniques used in relational databases (e.g., indexing and cost-based optimization). Consequently, dynamic optimization techniques for such join-intensive tasks on MapReduce need to be investigated. In some previous efforts, we propose a Nested Triple Group data model and Algebra (NTGA) for efficient graph pattern query processing in the cloud. Here, we extend this work with a scan-sharing technique that is used to optimize the processing of graph patterns with repeated properties. Specifically, our scan-sharing technique eliminates the need for repeated scanning of input relations when properties are used repeatedly in graph patterns. A formal foundation underlying this scan sharing technique is discussed as well as an implementation strategy that has been integrated in the Apache Pig framework is presented. We also present a comprehensive evaluation demonstrating performance benefits of our NTGA plus scan-sharing approach.
Keywords :
cloud computing; data models; graph theory; pattern matching; program diagnostics; query processing; relational databases; software performance evaluation; Apache Pig framework; MapReduce computation model; MapReduce workflows; NTGA; RDF data collections; RDF graph pattern matching optimization; RDF query processing; data processing; data transfer; dynamic optimization techniques; graph pattern query processing; join-intensive workloads; machine cluster; nested triple group data model-and-algebra; relational databases; resource description framework; scan-sharing technique; sorting costs; static optimization techniques; Algebra; Cloning; Context; Data models; Optimization; Pattern matching; Resource description framework; MapReduce; Optimization Techniques; RDF Graph Pattern Matching; SPARQL; cloud computing;
Conference_Titel :
Cloud Computing (CLOUD), 2012 IEEE 5th International Conference on
Conference_Location :
Honolulu, HI
Print_ISBN :
978-1-4673-2892-0
DOI :
10.1109/CLOUD.2012.14