• DocumentCode
    3066499
  • Title

    Classification using pattern probability estimators

  • Author

    Acharya, Jayadev ; Das, Hirakendu ; Orlitsky, Alon ; Pan, Shengjun ; Santhanam, Narayana P.

  • Author_Institution
    ECE, UCSD, La Jolla, CA, USA
  • fYear
    2010
  • fDate
    13-18 June 2010
  • Firstpage
    1493
  • Lastpage
    1497
  • Abstract
    We consider the problem of classification, where the data of the classes are generated i.i.d. according to unknown probability distributions. The goal is to classify test data with minimum error probability, based on the training data available for the classes. The Likelihood Ratio Test (LRT) is the optimal decision rule when the distributions are known. Hence, a popular approach for classification is to estimate the likelihoods using well known probability estimators, e.g., the Laplace and Good-Turing estimators, and use them in a LRT. We are primarily interested in situations where the alphabet of the underlying distributions is large compared to the training data available, which is indeed the case in most practical applications. We motivate and propose LRT´s based on pattern probability estimators that are known to achieve low redundancy for universal compression of large alphabet sources. While a complete proof for optimality of these decision rules is warranted, we demonstrate their performance and compare it with other well-known classifiers by various experiments on synthetic data and real data for text classification.
  • Keywords
    data compression; error statistics; pattern classification; statistical distributions; text analysis; Laplace estimators; good-Turing estimators; large alphabet source universal compression; likelihood ratio test; minimum error probability; optimal decision rule; pattern probability estimators; probability distributions; text classification; Error probability; Information theory; Light rail systems; Machine learning; Optical character recognition software; Probability distribution; Redundancy; Testing; Text categorization; Training data;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Information Theory Proceedings (ISIT), 2010 IEEE International Symposium on
  • Conference_Location
    Austin, TX
  • Print_ISBN
    978-1-4244-7890-3
  • Electronic_ISBN
    978-1-4244-7891-0
  • Type

    conf

  • DOI
    10.1109/ISIT.2010.5513570
  • Filename
    5513570