DocumentCode
610407
Title
Automating pattern discovery for rule based data standardization systems
Author
Chaturvedi, Sushil ; Prasad, K.H. ; Faruquie, T.A. ; Chawda, B.S. ; Subramaniam, L. Venkata ; Krishnapuram, R.
Author_Institution
IBM Res. - India, New Delhi, India
fYear
2013
fDate
8-12 April 2013
Firstpage
1231
Lastpage
1241
Abstract
Data quality is a perennial problem for many enterprise data assets. To improve data quality, businesses often employ rule based data standardization systems in which domain experts code rules for handling important and prevalent patterns. Finding these patterns is laborious and time consuming, particularly for noisy or highly specialized data sets. It is also subjective to the persons determining these patterns. In this paper we present a tool to automatically mine patterns that can help in improving the efficiency and effectiveness of these data standardization systems. The automatically extracted patterns are used by the domain and knowledge experts for rule writing. We use a greedy algorithm to extract patterns that result in a maximal coverage of data. We further group the extracted patterns such that each group represents patterns that capture similar domain knowledge. We propose a similarity measure that uses input pattern semantics to group these patterns. We demonstrate the effectiveness of our method for standardization tasks on three real world datasets.
Keywords
business data processing; data mining; greedy algorithms; knowledge based systems; standardisation; automatically pattern mining; data quality; domain experts; domain knowledge; enterprise data assets; greedy algorithm; knowledge experts; pattern discovery automation; pattern extraction; pattern handling; perennial problem; real world datasets; rule based data standardization systems; rule writing; similarity measure; Buildings; Data mining; Noise measurement; Semantics; Standards; Writing;
fLanguage
English
Publisher
ieee
Conference_Titel
Data Engineering (ICDE), 2013 IEEE 29th International Conference on
Conference_Location
Brisbane, QLD
ISSN
1063-6382
Print_ISBN
978-1-4673-4909-3
Electronic_ISBN
1063-6382
Type
conf
DOI
10.1109/ICDE.2013.6544912
Filename
6544912
Link To Document