Parallel and distributed approach for processing large-scale XML datasets

Author

Fadika, Zacharia ; Head, Michael R. ; Govindaraju, Madhusudhan

Author_Institution

Comput. Sci. Dept., Binghamton Univ., Binghamton, NY, USA

fYear

2009

fDate

13-15 Oct. 2009

Firstpage

105

Lastpage

112

Abstract

An emerging trend is the use of XML as the data format for many distributed scientific applications, with the size of these documents ranging from tens of megabytes to hundreds of megabytes. Our earlier benchmarking results revealed that most of the widely available XML processing toolkits do not scale well for large sized XML data. A significant transformation is necessary in the design of XML processing for scientific applications so that the overall application turn-around time is not negatively affected. We present both a parallel and distributed approach to analyze how the scalability and performance requirements of large-scale XML-based data processing can be achieved. We have adapted the Hadoop implementation to determine the threshold data sizes and computation work required per node, for a distributed solution to be effective. We also present an analysis of parallelism using our Piximal toolkit for processing large-scale XML datasets that utilizes the capabilities for parallelism that are available in the emerging multi-core architectures. Multi-core processors are expected to be widely available in research clusters and scientific desktops, and it is critical to harness the opportunities for parallelism in the middleware, instead of passing on the task to application programmers. Our parallelization approach for a multi-core node is to employ a DFA-based parser that recognizes a useful subset of the XML specification, and convert the DFA into an NFA that can be applied to an arbitrary subset of the input. Speculative NFAs are scheduled on available cores in a node to effectively utilize the processing capabilities and achieve overall performance gains. We evaluate the efficacy of this approach in terms of potential speedup that can be achieved for representative XML data sets.

Keywords

XML; finite automata; grammars; parallel processing; DFA-based parser; Hadoop implementation; Piximal toolkit; XML data processing toolkit; XML specification; data format; deterministic finite automata; distributed scientific application; eXtensible Markup Language; large-scale XML dataset; middleware; multicore architectures; multicore node; multicore processors; parellel processing; Computer architecture; Data processing; Distributed computing; Large-scale systems; Multicore processing; Parallel processing; Performance analysis; Process design; Scalability; XML;

fLanguage

English

Publisher

ieee

Conference_Titel

Grid Computing, 2009 10th IEEE/ACM International Conference on

Conference_Location

Banff, AB

Print_ISBN

978-1-4244-5148-7

Electronic_ISBN

978-1-4244-5149-4

Type

conf

DOI

10.1109/GRID.2009.5353070

Filename

5353070