DocumentCode :
3105982
Title :
An Algorithm of Detection Duplicate Information Based on Segment
Author :
Liu Zhe ; Zhao Zhi-gang
Author_Institution :
Comput. Center, Shenyang Normal Univ., Shenyang, China
fYear :
2010
fDate :
26-28 Sept. 2010
Firstpage :
156
Lastpage :
159
Abstract :
It´s a hot issue to detect and eliminate approximately duplicate records in data cleansing. Aiming at recognizing duplicate records of multi-language data, the segment strategy based on character features is proposed, and the algorithm of edit distance with variable weight is presented. The experiment results indicate that the segment time is small change along with data scale growing. It means the total running time of detecting duplicate records is not influenced by segment in the big data scale. The experiment results also indicate that the algorithm running efficiency and detect precision can be improved.
Keywords :
data handling; data warehouses; algorithm running efficiency; character features; data cleansing; data scale growing; data warehouse; detect precision; duplicate information detection; duplicate records; edit distance; multilanguage data; segment strategy; Algorithm design and analysis; Approximation algorithms; Data models; Data warehouses; Databases; Feature extraction; Merging; Approximately duplicate records; Segment; Variable weight; algorithm of edit distance;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Computational Aspects of Social Networks (CASoN), 2010 International Conference on
Conference_Location :
Taiyuan
Print_ISBN :
978-1-4244-8785-1
Type :
conf
DOI :
10.1109/CASoN.2010.42
Filename :
5636818
Link To Document :
بازگشت