Title :
Classification using pattern probability estimators
Author :
Acharya, Jayadev ; Das, Hirakendu ; Orlitsky, Alon ; Pan, Shengjun ; Santhanam, Narayana P.
Author_Institution :
ECE, UCSD, La Jolla, CA, USA
Abstract :
We consider the problem of classification, where the data of the classes are generated i.i.d. according to unknown probability distributions. The goal is to classify test data with minimum error probability, based on the training data available for the classes. The Likelihood Ratio Test (LRT) is the optimal decision rule when the distributions are known. Hence, a popular approach for classification is to estimate the likelihoods using well known probability estimators, e.g., the Laplace and Good-Turing estimators, and use them in a LRT. We are primarily interested in situations where the alphabet of the underlying distributions is large compared to the training data available, which is indeed the case in most practical applications. We motivate and propose LRT´s based on pattern probability estimators that are known to achieve low redundancy for universal compression of large alphabet sources. While a complete proof for optimality of these decision rules is warranted, we demonstrate their performance and compare it with other well-known classifiers by various experiments on synthetic data and real data for text classification.
Keywords :
data compression; error statistics; pattern classification; statistical distributions; text analysis; Laplace estimators; good-Turing estimators; large alphabet source universal compression; likelihood ratio test; minimum error probability; optimal decision rule; pattern probability estimators; probability distributions; text classification; Error probability; Information theory; Light rail systems; Machine learning; Optical character recognition software; Probability distribution; Redundancy; Testing; Text categorization; Training data;
Conference_Titel :
Information Theory Proceedings (ISIT), 2010 IEEE International Symposium on
Conference_Location :
Austin, TX
Print_ISBN :
978-1-4244-7890-3
Electronic_ISBN :
978-1-4244-7891-0
DOI :
10.1109/ISIT.2010.5513570