• DocumentCode
    1881590
  • Title

    To be or not to be IID: Can Zipf´s Law help?

  • Author

    Behe, Leo ; Wheeler, Zachary ; Nelson, Christie ; Knopp, Brian ; Pottenger, William M.

  • Author_Institution
    Lehigh Univ., Bethlehem, PA, USA
  • fYear
    2015
  • fDate
    14-16 April 2015
  • Firstpage
    1
  • Lastpage
    6
  • Abstract
    Classification is a popular problem within machine learning, and increasing the effectiveness of classification algorithms has many significant applications within industry and academia. In particular, focus will be given to Higher-Order Naive Bayes (HONB), a relational variant of the famed Naive Bayes (NB) statistical classification algorithm that has been shown to outperform Naive Bayes in many cases [1,10]. Specifically, HONB has outperformed NB on character n-gram based feature spaces when the available training data is small [2]. In this paper, a correlation is hypothesized between the performance of HONB on character n-gram feature spaces and how closely the feature space distribution follows Zipf´s Law. This hypothesis stems from the overarching goal of ultimately understanding HONB and knowing when it will outperform NB. Textual datasets ranging from several thousand instances to nearly 20,000 instances, some containing microtext, were used to generate character n-gram feature spaces. HONB and NB were both used to model these datasets, using varying character n-gram sizes (2-7) and dictionary sizes up to 5000 features. The performances of HONB and NB were then compared, and the results show potential support for our hypothesis: namely, the results support the hypothesized correlation for the Accuracy and Precision metrics. Additionally, a solution is provided for an open problem which was presented in [1], giving an explicit formula for the number of SDRs from k given sets, which has connections to counting higher-order paths of arbitrary length, which are important in the learning stage of HONB.
  • Keywords
    Bayes methods; learning (artificial intelligence); natural language processing; pattern classification; text analysis; HONB; IDD; Zipf´s law; accuracy metrics; character n-gram based feature spaces; character n-gram feature spaces; classification algorithms; feature space distribution; higher-order naive Bayes; independent and identically distributed; machine learning; naive Bayes statistical classification algorithm; precision metrics; textual datasets; Accuracy; Classification algorithms; Correlation; Earthquakes; Measurement; Niobium; Prediction algorithms;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Technologies for Homeland Security (HST), 2015 IEEE International Symposium on
  • Conference_Location
    Waltham, MA
  • Print_ISBN
    978-1-4799-1736-5
  • Type

    conf

  • DOI
    10.1109/THS.2015.7225274
  • Filename
    7225274