• DocumentCode
    1864079
  • Title

    A Unified Record Linkage Strategy for Web Service Data

  • Author

    Kan Qin ; Yang, Yujiu ; Zhen, Shiqiang ; Liu, Wenhuang

  • Author_Institution
    Div. of Inf., Tsinghua Univ., Shenzhen, China
  • fYear
    2010
  • fDate
    9-10 Jan. 2010
  • Firstpage
    253
  • Lastpage
    256
  • Abstract
    Record linkage, also known as duplicate detection, is a key process that ensures the quality of data stored for Web service data. Given two lists of records, record linkage consists of determining all pairs that are similar to each other, where the overall similarity between two records is defined based on domain-specific similarities over individual attributes constituting the record. In this paper, we present a unified framework for recognizing clusters of near-duplicate records of multi-language data, specially for Chinese/English mixed Web data. The key ideas are: (1)Pre-processing multi-language data Using Chinese words segmentation and Chinese named entity recognition techniques; (2) Pair-wise comparison method based on domain- specific similarities, especially, the string kernel method; (3)a priority queue of duplicate clusters and representative records strategy to respond adaptively to the data scale. Experiments on real databases show that the proposed recode linkage strategy is efficiency and effectiveness.
  • Keywords
    Web services; data analysis; natural language processing; Chinese named entity recognition; Chinese words segmentation; Chinese/English mixed web data; clusters recognition; data quality; domain-specific similarities; duplicate detection; pairwise comparison method; preprocessing multilanguage data; string kernel method;; unified framework; unified record linkage strategy; web service data; Clustering methods; Couplings; Data mining; Databases; Informatics; Kernel; Natural languages; Resumes; Storage automation; Web services;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Knowledge Discovery and Data Mining, 2010. WKDD '10. Third International Conference on
  • Conference_Location
    Phuket
  • Print_ISBN
    978-1-4244-5397-9
  • Electronic_ISBN
    978-1-4244-5398-6
  • Type

    conf

  • DOI
    10.1109/WKDD.2010.134
  • Filename
    5432640