مرکز منطقه ای اطلاع رساني علوم و فناوري - An Algorithm of Detection Duplicate Information Based on Segment

DocumentCode :

3105982

Title :

An Algorithm of Detection Duplicate Information Based on Segment

Author :

Liu Zhe ; Zhao Zhi-gang

Author_Institution :

Comput. Center, Shenyang Normal Univ., Shenyang, China

fYear :

2010

fDate :

26-28 Sept. 2010

Firstpage :

156

Lastpage :

159

Abstract :

It´s a hot issue to detect and eliminate approximately duplicate records in data cleansing. Aiming at recognizing duplicate records of multi-language data, the segment strategy based on character features is proposed, and the algorithm of edit distance with variable weight is presented. The experiment results indicate that the segment time is small change along with data scale growing. It means the total running time of detecting duplicate records is not influenced by segment in the big data scale. The experiment results also indicate that the algorithm running efficiency and detect precision can be improved.

Keywords :

data handling; data warehouses; algorithm running efficiency; character features; data cleansing; data scale growing; data warehouse; detect precision; duplicate information detection; duplicate records; edit distance; multilanguage data; segment strategy; Algorithm design and analysis; Approximation algorithms; Data models; Data warehouses; Databases; Feature extraction; Merging; Approximately duplicate records; Segment; Variable weight; algorithm of edit distance;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Computational Aspects of Social Networks (CASoN), 2010 International Conference on

Conference_Location :

Taiyuan

Print_ISBN :

978-1-4244-8785-1

Type :

conf

DOI :

10.1109/CASoN.2010.42

Filename :

5636818

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=3105982