DocumentCode :
3305771
Title :
Reducing inconsistency in integrating data from different sources
Author :
Luján-Mora, Sergio ; Palomar, Manuel
Author_Institution :
Dept. de Lenguajes y Sistemas Inf., Alicante Univ., Spain
fYear :
2001
fDate :
2001
Firstpage :
209
Lastpage :
218
Abstract :
One of the main problems in integrating databases into a common repository is the possible inconsistency of the values stored in them, i.e., the very same term may have different values, due to misspelling, a permuted word order, spelling variants and so on. The authors present an automatic method for reducing inconsistency found in existing databases, and thus, improving data quality. All the values that refer to a same term are clustered by measuring their degree of similarity. The clustered values can be assigned to a common value that, in principle, could be substituted for the original values. We evaluate four different similarity measures for clustering with and without expansion of abbreviations. The method we propose may work well in practice but it is time-consuming. In order to reduce this problem, we remove stop words for speeding up the clustering
Keywords :
data integrity; database management systems; pattern clustering; string matching; word processing; abbreviation expansion; automatic method; clustered values; common repository; common value; data clustering; data integration; data quality; databases; inconsistency reduction; misspelling; permuted word order; similarity degree; similarity measures; spelling variants; stop words; Cleaning; Data warehouses; Decision making; Information systems; Proposals; Relational databases;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Database Engineering and Applications, 2001 International Symposium on.
Conference_Location :
Grenoble
Print_ISBN :
0-7695-1140-6
Type :
conf
DOI :
10.1109/IDEAS.2001.938087
Filename :
938087
Link To Document :
بازگشت