DocumentCode :
1126008
Title :
Toward unsupervised correlation preserving discretization
Author :
Mehta, Sameep ; Parthasarathy, Srinivasan ; Yang, Hui
Author_Institution :
Dept. of Comput. & Eng., Ohio State Univ., Columbus, OH, USA
Volume :
17
Issue :
9
fYear :
2005
Firstpage :
1174
Lastpage :
1185
Abstract :
Discretization is a crucial preprocessing technique used for a variety of data warehousing and mining tasks. In this paper, we present a novel PCA-based unsupervised algorithm for the discretization of continuous attributes in multivariate data sets. The algorithm leverages the underlying correlation structure in the data set to obtain the discrete intervals and ensures that the inherent correlations are preserved. Previous efforts on this problem are largely supervised and consider only piecewise correlation among attributes. We consider the correlation among continuous attributes and, at the same time, also take into account the interactions between continuous and categorical attributes. Our approach also extends easily to data sets containing missing values. We demonstrate the efficacy of the approach on real data sets and as a preprocessing step for both classification and frequent itemset mining tasks. We show that the intervals are meaningful and can uncover hidden patterns in data. We also show that large compression factors can be obtained on the discretized data sets. The approach is task independent, i.e., the same discretized data set can be used for different data mining tasks. Thus, the data sets can be discretized, compressed, and stored once and can be used again and again.
Keywords :
data compression; data mining; data warehouses; pattern classification; principal component analysis; unsupervised learning; PCA-based unsupervised algorithm; categorical attributes; continuous attributes discretization; data compression; data mining; data preprocessing technique; frequent itemset mining tasks; multivariate data sets; piecewise correlation; principal component analysis; unsupervised correlation preserving discretization; Classification tree analysis; Data compression; Data mining; Data preprocessing; Databases; Decision trees; Discrete transforms; Itemsets; Principal component analysis; Warehousing; Index Terms- Data preprocessing; data compression.; data mining/summarization; missing data; principal component analysis;
fLanguage :
English
Journal_Title :
Knowledge and Data Engineering, IEEE Transactions on
Publisher :
ieee
ISSN :
1041-4347
Type :
jour
DOI :
10.1109/TKDE.2005.153
Filename :
1490525
Link To Document :
بازگشت