DocumentCode
1864079
Title
A Unified Record Linkage Strategy for Web Service Data
Author
Kan Qin ; Yang, Yujiu ; Zhen, Shiqiang ; Liu, Wenhuang
Author_Institution
Div. of Inf., Tsinghua Univ., Shenzhen, China
fYear
2010
fDate
9-10 Jan. 2010
Firstpage
253
Lastpage
256
Abstract
Record linkage, also known as duplicate detection, is a key process that ensures the quality of data stored for Web service data. Given two lists of records, record linkage consists of determining all pairs that are similar to each other, where the overall similarity between two records is defined based on domain-specific similarities over individual attributes constituting the record. In this paper, we present a unified framework for recognizing clusters of near-duplicate records of multi-language data, specially for Chinese/English mixed Web data. The key ideas are: (1)Pre-processing multi-language data Using Chinese words segmentation and Chinese named entity recognition techniques; (2) Pair-wise comparison method based on domain- specific similarities, especially, the string kernel method; (3)a priority queue of duplicate clusters and representative records strategy to respond adaptively to the data scale. Experiments on real databases show that the proposed recode linkage strategy is efficiency and effectiveness.
Keywords
Web services; data analysis; natural language processing; Chinese named entity recognition; Chinese words segmentation; Chinese/English mixed web data; clusters recognition; data quality; domain-specific similarities; duplicate detection; pairwise comparison method; preprocessing multilanguage data; string kernel method;; unified framework; unified record linkage strategy; web service data; Clustering methods; Couplings; Data mining; Databases; Informatics; Kernel; Natural languages; Resumes; Storage automation; Web services;
fLanguage
English
Publisher
ieee
Conference_Titel
Knowledge Discovery and Data Mining, 2010. WKDD '10. Third International Conference on
Conference_Location
Phuket
Print_ISBN
978-1-4244-5397-9
Electronic_ISBN
978-1-4244-5398-6
Type
conf
DOI
10.1109/WKDD.2010.134
Filename
5432640
Link To Document