Title :
Answering approximate string queries on large data sets using external memory
Author :
Behm, Alexander ; Li, Chen ; Carey, Michael J.
Author_Institution :
Deptarment of Comput. Sci., Univ. of California, Irvine, CA, USA
Abstract :
An approximate string query is to find from a collection of strings those that are similar to a given query string. Answering such queries is important in many applications such as data cleaning and record linkage, where errors could occur in queries as well as the data. Many existing algorithms have focused on in-memory indexes. In this paper we investigate how to efficiently answer such queries in a disk-based setting, by systematically studying the effects of storing data and indexes on disk. We devise a novel physical layout for an inverted index to answer queries and we study how to construct it with limited buffer space. To answer queries, we develop a cost-based, adaptive algorithm that balances the I/O costs of retrieving candidate matches and accessing inverted lists. Experiments on large, real datasets verify that simply adapting existing algorithms to a disk-based setting does not work well and that our new techniques answer queries efficiently. Further, our solutions significantly outperform a recent tree-based index, BED-tree.
Keywords :
query processing; question answering (information retrieval); trees (mathematics); BED-tree; adaptive algorithm; approximate string query answering; data cleaning; disk-based setting; external memory; in-memory indexes; large data sets; record linkage; Adaptive algorithms; Approximation algorithms; Indexes; Layout; Maintenance engineering; Organizations; Standards organizations;
Conference_Titel :
Data Engineering (ICDE), 2011 IEEE 27th International Conference on
Conference_Location :
Hannover
Print_ISBN :
978-1-4244-8959-6
Electronic_ISBN :
1063-6382
DOI :
10.1109/ICDE.2011.5767856