• DocumentCode
    3143541
  • Title

    Answering approximate string queries on large data sets using external memory

  • Author

    Behm, Alexander ; Li, Chen ; Carey, Michael J.

  • Author_Institution
    Deptarment of Comput. Sci., Univ. of California, Irvine, CA, USA
  • fYear
    2011
  • fDate
    11-16 April 2011
  • Firstpage
    888
  • Lastpage
    899
  • Abstract
    An approximate string query is to find from a collection of strings those that are similar to a given query string. Answering such queries is important in many applications such as data cleaning and record linkage, where errors could occur in queries as well as the data. Many existing algorithms have focused on in-memory indexes. In this paper we investigate how to efficiently answer such queries in a disk-based setting, by systematically studying the effects of storing data and indexes on disk. We devise a novel physical layout for an inverted index to answer queries and we study how to construct it with limited buffer space. To answer queries, we develop a cost-based, adaptive algorithm that balances the I/O costs of retrieving candidate matches and accessing inverted lists. Experiments on large, real datasets verify that simply adapting existing algorithms to a disk-based setting does not work well and that our new techniques answer queries efficiently. Further, our solutions significantly outperform a recent tree-based index, BED-tree.
  • Keywords
    query processing; question answering (information retrieval); trees (mathematics); BED-tree; adaptive algorithm; approximate string query answering; data cleaning; disk-based setting; external memory; in-memory indexes; large data sets; record linkage; Adaptive algorithms; Approximation algorithms; Indexes; Layout; Maintenance engineering; Organizations; Standards organizations;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Data Engineering (ICDE), 2011 IEEE 27th International Conference on
  • Conference_Location
    Hannover
  • ISSN
    1063-6382
  • Print_ISBN
    978-1-4244-8959-6
  • Electronic_ISBN
    1063-6382
  • Type

    conf

  • DOI
    10.1109/ICDE.2011.5767856
  • Filename
    5767856