Title :
Scoring levels of categorical variables with heterogeneous data
Author :
Tuv, Eugene ; Runger, George C.
Author_Institution :
Anal. & Control Technol., Intel Corp., Chandler, AZ, USA
Abstract :
Heterogeneous (mixed-type) data present significant challenges in both supervised and unsupervised learning. The situation is even more complicated when nominal variables have several levels (values) that make using indicator variables (for every categorical level) infeasible. With unsupervised learning, several fairly involved, computationally intensive, nonlinear multivariate techniques iteratively alternate data transformations with optimal scoring. These seek to optimize an objective on the basis of a covariance matrix. Our goal is to find a computationally efficient and flexible method for mapping categorical variables to numeric scores in mixed-type data. We attempt to go beyond optimizing second-order statistics (such as covariance) and enable distance-based methods by exploring mutual relationships or bumps of dependencies between variables. This is a new objective for a scoring method that´s based on patterns learned from all the available variables.
Keywords :
distributed databases; optimisation; regression analysis; statistics; unsupervised learning; categorical variable; distance-based method; heterogeneous mixed-type data; nonlinear multivariate technique; scoring level; second-order statistics optimization; supervised learning; unsupervised learning; Classification tree analysis; Covariance matrix; Density functional theory; Function approximation; Independent component analysis; Multidimensional systems; Optimization methods; Regression tree analysis; Statistics; Unsupervised learning;
Journal_Title :
Intelligent Systems, IEEE
DOI :
10.1109/MIS.2004.1274906