Title :
Classifying non-gaussian and mixed data sets in their natural parameter space
Author :
Levasseur, Cécile ; Mayer, Uwe F. ; Kreutz-Delgado, Ken
Author_Institution :
Jacobs Sch. of Eng., Univ. of California, San Diego, La Jolla, CA, USA
Abstract :
We consider the problem of both supervised and unsupervised classification for multidimensional data that are non-Gaussian and of mixed types (continuous and/or discrete). An important subclass of graphical model techniques called generalized linear statistics (GLS) is used to capture the underlying statistical structure of these complex data. GLS exploits the properties of exponential family distributions, which are assumed to describe the data components, and constrains latent variables to a lower dimensional parameter subspace. Based on the latent variable information, classification is performed in the natural parameter subspace with classical statistical techniques. The benefits of decision making in parameter space is illustrated with examples of categorical data text categorization and mixed-type data classification. As a text document preprocessing tool, an extension from binary to categorical data of the conditional mutual information maximization based feature selection algorithm is presented.
Keywords :
belief networks; decision making; pattern classification; statistical analysis; text analysis; categorical data text categorization; classical statistical techniques; conditional mutual information maximization; decision making; directed graph; exponential family distribution; feature selection algorithm; generalized linear statistics; graphical model techniques; lower dimensional parameter subspace; mixed data sets; mixed-type data classification; multidimensional data; natural parameter space; nonGaussian classification; statistical structure; supervised classification; text document preprocessing tool; unsupervised classification; Cities and towns; Data engineering; Graphical models; Jacobian matrices; Mathematics; Multidimensional systems; Principal component analysis; Statistics; Subspace constraints; Text categorization;
Conference_Titel :
Machine Learning for Signal Processing, 2009. MLSP 2009. IEEE International Workshop on
Conference_Location :
Grenoble
Print_ISBN :
978-1-4244-4947-7
Electronic_ISBN :
978-1-4244-4948-4
DOI :
10.1109/MLSP.2009.5306227