• DocumentCode
    49541
  • Title

    Efficient and Scalable Processing of String Similarity Join

  • Author

    Chuitian Rong ; Wei Lu ; Xiaoli Wang ; Xiaoyong Du ; Yueguo Chen ; Tung, A.K.H.

  • Author_Institution
    Key Lab. of Data Eng. & Knowledge Eng., Renmin Univ. of China, Beijing, China
  • Volume
    25
  • Issue
    10
  • fYear
    2013
  • fDate
    Oct. 2013
  • Firstpage
    2217
  • Lastpage
    2230
  • Abstract
    The string similarity join is a basic operation of many applications that need to find all string pairs from a collection given a similarity function and a user-specified threshold. Recently, there has been considerable interest in designing new algorithms with the assistant of an inverted index to support efficient string similarity joins. These algorithms typically adopt a two-step filter-and-refine approach in identifying similar string pairs: 1) generating candidate pairs by traversing the inverted index; and 2) verifying the candidate pairs by computing the similarity. However, these algorithms either suffer from poor filtering power (which results in high verification cost), or incur too much computational cost to guarantee the filtering power. In this paper, we propose a multiple prefix filtering method based on different global orderings such that the number of candidate pairs can be reduced significantly. We also propose a parallel extension of the algorithm that is efficient and scalable in a MapReduce framework. We conduct extensive experiments on both centralized and Hadoop systems using both real and synthetic data sets, and the results show that our proposed approach outperforms existing approaches in both efficiency and scalability.
  • Keywords
    data handling; indexing; parallel processing; Hadoop system; MapReduce framework; candidate pair generation; computational cost; filtering power; global ordering; inverted index; multiple prefix filtering method; parallel extension; scalable processing; similar string pair identification; similarity function; string similarity join; two-step filter-and-refine approach; verification cost; Algorithm design and analysis; Educational institutions; Filtering; Indexes; Pipeline processing; Transforms; XML; Algorithm design and analysis; Educational institutions; Filtering; Indexes; MapReduce; Pipeline processing; Similarity join; Transforms; XML; multiple filtering;
  • fLanguage
    English
  • Journal_Title
    Knowledge and Data Engineering, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    1041-4347
  • Type

    jour

  • DOI
    10.1109/TKDE.2012.195
  • Filename
    6319300