• DocumentCode
    9162
  • Title

    VChunkJoin: An Efficient Algorithm for Edit Similarity Joins

  • Author

    Wei Wang ; Jianbin Qin ; Chuan Xiao ; Xuemin Lin ; Shen, Heng Tao

  • Author_Institution
    Sch. of Comput. Sci. & Eng., Univ. of New South Wales (UNSW), Sydney, NSW, Australia
  • Volume
    25
  • Issue
    8
  • fYear
    2013
  • fDate
    Aug. 2013
  • Firstpage
    1916
  • Lastpage
    1929
  • Abstract
    Similarity joins play an important role in many application areas, such as data integration and cleaning, record linkage, and pattern recognition. In this paper, we study efficient algorithms for similarity joins with an edit distance constraint. Currently, the most prevalent approach is based on extracting overlapping grams from strings and considering only strings that share a certain number of grams as candidates. Unlike these existing approaches, we propose a novel approach to edit similarity join based on extracting nonoverlapping substrings, or chunks, from strings. We propose a class of chunking schemes based on the notion of tail-restricted chunk boundary dictionary. A new algorithm, VChunkJoin, is designed by integrating existing filtering methods and several new filters unique to our chunk-based method. We also design a greedy algorithm to automatically select a good chunking scheme for a given data set. We demonstrate experimentally that the new algorithm is faster than alternative methods yet occupies less space.
  • Keywords
    constraint handling; data handling; data integration; pattern recognition; VChunkJoin; data cleaning; data integration; edit distance constraint; edit similarity joins; filtering methods; nonoverlapping substrings; overlapping grams; pattern recognition; record linkage; tail-restricted chunk boundary dictionary; Algorithm design and analysis; Approximation algorithms; Arrays; Dictionaries; Filtering; Indexes; Robustness; Edit similarity joins; approximate string matching; edit distance; prefix filtering; q-gram;
  • fLanguage
    English
  • Journal_Title
    Knowledge and Data Engineering, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    1041-4347
  • Type

    jour

  • DOI
    10.1109/TKDE.2012.79
  • Filename
    6185552