• DocumentCode
    2673588
  • Title

    Parallel and distributed approach for processing large-scale XML datasets

  • Author

    Fadika, Zacharia ; Head, Michael R. ; Govindaraju, Madhusudhan

  • Author_Institution
    Comput. Sci. Dept., Binghamton Univ., Binghamton, NY, USA
  • fYear
    2009
  • fDate
    13-15 Oct. 2009
  • Firstpage
    105
  • Lastpage
    112
  • Abstract
    An emerging trend is the use of XML as the data format for many distributed scientific applications, with the size of these documents ranging from tens of megabytes to hundreds of megabytes. Our earlier benchmarking results revealed that most of the widely available XML processing toolkits do not scale well for large sized XML data. A significant transformation is necessary in the design of XML processing for scientific applications so that the overall application turn-around time is not negatively affected. We present both a parallel and distributed approach to analyze how the scalability and performance requirements of large-scale XML-based data processing can be achieved. We have adapted the Hadoop implementation to determine the threshold data sizes and computation work required per node, for a distributed solution to be effective. We also present an analysis of parallelism using our Piximal toolkit for processing large-scale XML datasets that utilizes the capabilities for parallelism that are available in the emerging multi-core architectures. Multi-core processors are expected to be widely available in research clusters and scientific desktops, and it is critical to harness the opportunities for parallelism in the middleware, instead of passing on the task to application programmers. Our parallelization approach for a multi-core node is to employ a DFA-based parser that recognizes a useful subset of the XML specification, and convert the DFA into an NFA that can be applied to an arbitrary subset of the input. Speculative NFAs are scheduled on available cores in a node to effectively utilize the processing capabilities and achieve overall performance gains. We evaluate the efficacy of this approach in terms of potential speedup that can be achieved for representative XML data sets.
  • Keywords
    XML; finite automata; grammars; parallel processing; DFA-based parser; Hadoop implementation; Piximal toolkit; XML data processing toolkit; XML specification; data format; deterministic finite automata; distributed scientific application; eXtensible Markup Language; large-scale XML dataset; middleware; multicore architectures; multicore node; multicore processors; parellel processing; Computer architecture; Data processing; Distributed computing; Large-scale systems; Multicore processing; Parallel processing; Performance analysis; Process design; Scalability; XML;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Grid Computing, 2009 10th IEEE/ACM International Conference on
  • Conference_Location
    Banff, AB
  • Print_ISBN
    978-1-4244-5148-7
  • Electronic_ISBN
    978-1-4244-5149-4
  • Type

    conf

  • DOI
    10.1109/GRID.2009.5353070
  • Filename
    5353070