• DocumentCode
    1656426
  • Title

    Intelligent Similarity Joins for Big Data Integration

  • Author

    Mian Wang ; Tiezheng Nie ; Derong Shen ; Yue Kou ; Ge Yu

  • Author_Institution
    Coll. of Inf. Sci. & Eng., Northeastern Univ., Shenyang, China
  • fYear
    2013
  • Firstpage
    383
  • Lastpage
    388
  • Abstract
    With the increasing amount of data, the record linkage has become a challenge for big data integration. Similarity join is an efficient approach to address the record linkage, but it is hardly achieved by the single node environment. In this paper, we propose a framework based on MapReduce for set similarity join. The techniques of framework improve the efficiency from two aspects: reducing candidate pairs and load balance. In reducing candidate pairs, we propose algorithms that combines multiple filtering principles to reduce the amount of candidate pairs. It includes length filter, prefix filter and position filter. The techniques for load balance are used to address the skew data and decrease the replication transfer volume. Experimental results on real dataset show that our approaches can achieve the speed-up over previous algorithms on big data.
  • Keywords
    Big Data; data integration; information filtering; resource allocation; Big Data integration; MapReduce; candidate pairs reduction; filtering principles; intelligent similarity join; length filter; load balance; position filter; prefix filter; real dataset; record linkage; replication transfer volume; set similarity join; skew data; Algorithm design and analysis; Data models; Filtering algorithms; Information filters; Information management; MapReduce; load balance; prefix filter; similarity join;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Web Information System and Application Conference (WISA), 2013 10th
  • Conference_Location
    Yangzhou
  • Print_ISBN
    978-1-4799-3218-4
  • Type

    conf

  • DOI
    10.1109/WISA.2013.79
  • Filename
    6778670