• DocumentCode
    3143741
  • Title

    Fast-join: An efficient method for fuzzy token matching based string similarity join

  • Author

    Wang, Jiannan ; Li, Guoliang ; Fe, Jianhua

  • Author_Institution
    Dept. of Comput. Sci. & Technol., Tsinghua Univ., Beijing, China
  • fYear
    2011
  • fDate
    11-16 April 2011
  • Firstpage
    458
  • Lastpage
    469
  • Abstract
    String similarity join that finds similar string pairs between two string sets is an essential operation in many applications, and has attracted significant attention recently in the database community. A significant challenge in similarity join is to implement an effective fuzzy match operation to find all similar string pairs which may not match exactly. In this paper, we propose a new similarity metrics, called “fuzzy token matching based similarity”, which extends token-based similarity functions (e.g., Jaccard similarity and Cosine similarity) by allowing fuzzy match between two tokens. We study the problem of similarity join using this new similarity metrics and present a signature-based method to address this problem. We propose new signature schemes and develop effective pruning techniques to improve the performance. Experimental results show that our approach achieves high efficiency and result quality, and significantly outperforms state-of-the-art methods.
  • Keywords
    fuzzy set theory; string matching; very large databases; database community; fast-join; fuzzy token matching; signature-based method; similarity metrics; string similarity join; Cleaning; Collaboration; Filtering; Iron; Measurement; Transforms; Upper bound;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Data Engineering (ICDE), 2011 IEEE 27th International Conference on
  • Conference_Location
    Hannover
  • ISSN
    1063-6382
  • Print_ISBN
    978-1-4244-8959-6
  • Electronic_ISBN
    1063-6382
  • Type

    conf

  • DOI
    10.1109/ICDE.2011.5767865
  • Filename
    5767865