Title :
Improving similarity join algorithms using vertical clustering techniques
Author :
Tan, Lisa ; Fotouhi, Farshad ; Grosky, William
Author_Institution :
Dept. of Comput. Sci., Wayne State Univ., Detroit, MI, USA
Abstract :
String is a primary data format in majority of applications. With the rapid growth of diverse data driven applications in the current information era, retrieving string data from heterogeneous structured sources becomes more and more significant and challenging. The main concern is duplicate records are created when data is integrated from heterogeneous sources. Those duplicate records represent the same real-world entity because of inconsistent values and naming conventions, incorrect or missing data values, or incomplete information. Existing approaches make the assumption that group of related attributes will participate in the similarity join operation. However, in this paper we propose a pre-processing technique to improve existing similarity join techniques. Assuming relational data sources, our approach is to identify groups of related attributes that when similarity join is applied, we reduce false positives and false negatives, and increase precisions and F-measure.
Keywords :
data handling; data mining; pattern clustering; duplicate records; relational data sources; similarity join algorithm; vertical clustering technique; Application software; Clustering algorithms; Computer science; Histograms; Hospitals; Information retrieval; Information science; Query processing; Relational databases; Telephony;
Conference_Titel :
Applications of Digital Information and Web Technologies, 2009. ICADIWT '09. Second International Conference on the
Conference_Location :
London
Print_ISBN :
978-1-4244-4456-4
Electronic_ISBN :
978-1-4244-4457-1
DOI :
10.1109/ICADIWT.2009.5273906