Title :
An Algorithm of Detection Duplicate Information Based on Segment
Author :
Liu Zhe ; Zhao Zhi-gang
Author_Institution :
Comput. Center, Shenyang Normal Univ., Shenyang, China
Abstract :
It´s a hot issue to detect and eliminate approximately duplicate records in data cleansing. Aiming at recognizing duplicate records of multi-language data, the segment strategy based on character features is proposed, and the algorithm of edit distance with variable weight is presented. The experiment results indicate that the segment time is small change along with data scale growing. It means the total running time of detecting duplicate records is not influenced by segment in the big data scale. The experiment results also indicate that the algorithm running efficiency and detect precision can be improved.
Keywords :
data handling; data warehouses; algorithm running efficiency; character features; data cleansing; data scale growing; data warehouse; detect precision; duplicate information detection; duplicate records; edit distance; multilanguage data; segment strategy; Algorithm design and analysis; Approximation algorithms; Data models; Data warehouses; Databases; Feature extraction; Merging; Approximately duplicate records; Segment; Variable weight; algorithm of edit distance;
Conference_Titel :
Computational Aspects of Social Networks (CASoN), 2010 International Conference on
Conference_Location :
Taiyuan
Print_ISBN :
978-1-4244-8785-1
DOI :
10.1109/CASoN.2010.42