Title :
An Efficient Duplicate Detection System for XML Documents
Author :
Lwin, Thandar ; Nyunt, Thi Thi Soe
Author_Institution :
Univ. of Comput. Studies, Yangon, Myanmar
Abstract :
Duplicate detection, which is an important subtask of data cleaning, is the task of identifying multiple representations of a same real-world object and necessary to improve data quality. Numerous approaches both for relational and XML data exist. As XML becomes increasingly popular for data exchange and data publishing on the Web, algorithms to detect duplicates in XML documents are required. Previous domain independent solutions to this problem relied on standard textual similarity functions (e.g., edit distance, cosine metric) between objects. However, such approaches result in large numbers of false positives if we want to identify domain-specific abbreviations and conventions. In this paper, we present the process of detecting duplicate includes three modules, such as selector, preprocessor and duplicate identifier which uses XML documents and candidate definition as input and produces duplicate objects as output. The aim of this research is to develop an efficient algorithm for detecting duplicate in complex XML documents and to reduce number of false positive by using MD5 algorithm. We illustrate the efficiency of this approach on several real-world datasets.
Keywords :
Internet; XML; document handling; electronic data interchange; Web; XML Documents; XML data; data cleaning; data exchange; data publishing; data quality; duplicate detection system; duplicate identifier; preprocessor; selector; Application software; Cleaning; Computer applications; Couplings; Data engineering; Data preprocessing; Databases; Object detection; Publishing; XML; Data Cleaning; Duplicate Detection; MD5 Algorithm; XML;
Conference_Titel :
Computer Engineering and Applications (ICCEA), 2010 Second International Conference on
Conference_Location :
Bali Island
Print_ISBN :
978-1-4244-6079-3
Electronic_ISBN :
978-1-4244-6080-9
DOI :
10.1109/ICCEA.2010.189