Title :
A Unified Record Linkage Strategy for Web Service Data
Author :
Kan Qin ; Yang, Yujiu ; Zhen, Shiqiang ; Liu, Wenhuang
Author_Institution :
Div. of Inf., Tsinghua Univ., Shenzhen, China
Abstract :
Record linkage, also known as duplicate detection, is a key process that ensures the quality of data stored for Web service data. Given two lists of records, record linkage consists of determining all pairs that are similar to each other, where the overall similarity between two records is defined based on domain-specific similarities over individual attributes constituting the record. In this paper, we present a unified framework for recognizing clusters of near-duplicate records of multi-language data, specially for Chinese/English mixed Web data. The key ideas are: (1)Pre-processing multi-language data Using Chinese words segmentation and Chinese named entity recognition techniques; (2) Pair-wise comparison method based on domain- specific similarities, especially, the string kernel method; (3)a priority queue of duplicate clusters and representative records strategy to respond adaptively to the data scale. Experiments on real databases show that the proposed recode linkage strategy is efficiency and effectiveness.
Keywords :
Web services; data analysis; natural language processing; Chinese named entity recognition; Chinese words segmentation; Chinese/English mixed web data; clusters recognition; data quality; domain-specific similarities; duplicate detection; pairwise comparison method; preprocessing multilanguage data; string kernel method;; unified framework; unified record linkage strategy; web service data; Clustering methods; Couplings; Data mining; Databases; Informatics; Kernel; Natural languages; Resumes; Storage automation; Web services;
Conference_Titel :
Knowledge Discovery and Data Mining, 2010. WKDD '10. Third International Conference on
Conference_Location :
Phuket
Print_ISBN :
978-1-4244-5397-9
Electronic_ISBN :
978-1-4244-5398-6
DOI :
10.1109/WKDD.2010.134