• DocumentCode
    3206533
  • Title

    A Fast Algorithm for Constructing Inverted Files on Heterogeneous Platforms

  • Author

    Wei, Zheng ; Jaja, Joseph

  • Author_Institution
    Dept. of Electr. & Comput. Eng., Univ. of Maryland, College Park, MD, USA
  • fYear
    2011
  • fDate
    16-20 May 2011
  • Firstpage
    1124
  • Lastpage
    1134
  • Abstract
    Given a collection of documents residing on a disk, we develop a new strategy for processing these documents and building the inverted files extremely fast. Our approach is tailored for a heterogeneous platform consisting of a multicore CPU and a highly multithreaded GPU. Our algorithm is based on a number of novel techniques including: (i) a high-throughput pipelined strategy that produces parallel parsed streams that are consumed at the same rate by parallel indexers, (ii) a hybrid trie and B-tree dictionary data structure in which the trie is represented by a table for fast look-up and each B-tree node contains string caches, (iii) allocation of parsed streams with frequent terms to CPU threads and the rest to GPU threads so as to match the throughput of parsed streams, and (iv) optimized CUDA indexer implementation that ensures coalesced memory accesses and effective use of shared memory. We have performed extensive tests of our algorithm on a single node (two Intel Xeon X5560 Quad-core) with two NVIDIA Tesla C1060 attached to it, and were able to achieve a throughput of more than 262 MB/s on the ClueWeb09 dataset. Similar results were obtained for widely different datasets. The throughput of our algorithm is superior to the best known algorithms reported in the literature even when compared to those run on large clusters.
  • Keywords
    computer graphic equipment; coprocessors; data structures; multiprocessing systems; B-tree dictionary data structure; CUDA indexer; Intel Xeon X5560 Quad-core; NVIDIA Tesla C1060; central processing unit; computer unified device architecture; graphics processing unit; heterogeneous platform; high-throughput pipelined strategy; hybrid trie data structure; inverted files construction; multicore CPU; multithreaded GPU; Data structures; Dictionaries; Graphics processing unit; Indexing; Instruction sets; Throughput;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Parallel & Distributed Processing Symposium (IPDPS), 2011 IEEE International
  • Conference_Location
    Anchorage, AK
  • ISSN
    1530-2075
  • Print_ISBN
    978-1-61284-372-8
  • Electronic_ISBN
    1530-2075
  • Type

    conf

  • DOI
    10.1109/IPDPS.2011.107
  • Filename
    6012919