DocumentCode
3143541
Title
Answering approximate string queries on large data sets using external memory
Author
Behm, Alexander ; Li, Chen ; Carey, Michael J.
Author_Institution
Deptarment of Comput. Sci., Univ. of California, Irvine, CA, USA
fYear
2011
fDate
11-16 April 2011
Firstpage
888
Lastpage
899
Abstract
An approximate string query is to find from a collection of strings those that are similar to a given query string. Answering such queries is important in many applications such as data cleaning and record linkage, where errors could occur in queries as well as the data. Many existing algorithms have focused on in-memory indexes. In this paper we investigate how to efficiently answer such queries in a disk-based setting, by systematically studying the effects of storing data and indexes on disk. We devise a novel physical layout for an inverted index to answer queries and we study how to construct it with limited buffer space. To answer queries, we develop a cost-based, adaptive algorithm that balances the I/O costs of retrieving candidate matches and accessing inverted lists. Experiments on large, real datasets verify that simply adapting existing algorithms to a disk-based setting does not work well and that our new techniques answer queries efficiently. Further, our solutions significantly outperform a recent tree-based index, BED-tree.
Keywords
query processing; question answering (information retrieval); trees (mathematics); BED-tree; adaptive algorithm; approximate string query answering; data cleaning; disk-based setting; external memory; in-memory indexes; large data sets; record linkage; Adaptive algorithms; Approximation algorithms; Indexes; Layout; Maintenance engineering; Organizations; Standards organizations;
fLanguage
English
Publisher
ieee
Conference_Titel
Data Engineering (ICDE), 2011 IEEE 27th International Conference on
Conference_Location
Hannover
ISSN
1063-6382
Print_ISBN
978-1-4244-8959-6
Electronic_ISBN
1063-6382
Type
conf
DOI
10.1109/ICDE.2011.5767856
Filename
5767856
Link To Document