Title :
To be or not to be IID: Can Zipf´s Law help?
Author :
Behe, Leo ; Wheeler, Zachary ; Nelson, Christie ; Knopp, Brian ; Pottenger, William M.
Author_Institution :
Lehigh Univ., Bethlehem, PA, USA
Abstract :
Classification is a popular problem within machine learning, and increasing the effectiveness of classification algorithms has many significant applications within industry and academia. In particular, focus will be given to Higher-Order Naive Bayes (HONB), a relational variant of the famed Naive Bayes (NB) statistical classification algorithm that has been shown to outperform Naive Bayes in many cases [1,10]. Specifically, HONB has outperformed NB on character n-gram based feature spaces when the available training data is small [2]. In this paper, a correlation is hypothesized between the performance of HONB on character n-gram feature spaces and how closely the feature space distribution follows Zipf´s Law. This hypothesis stems from the overarching goal of ultimately understanding HONB and knowing when it will outperform NB. Textual datasets ranging from several thousand instances to nearly 20,000 instances, some containing microtext, were used to generate character n-gram feature spaces. HONB and NB were both used to model these datasets, using varying character n-gram sizes (2-7) and dictionary sizes up to 5000 features. The performances of HONB and NB were then compared, and the results show potential support for our hypothesis: namely, the results support the hypothesized correlation for the Accuracy and Precision metrics. Additionally, a solution is provided for an open problem which was presented in [1], giving an explicit formula for the number of SDRs from k given sets, which has connections to counting higher-order paths of arbitrary length, which are important in the learning stage of HONB.
Keywords :
Bayes methods; learning (artificial intelligence); natural language processing; pattern classification; text analysis; HONB; IDD; Zipf´s law; accuracy metrics; character n-gram based feature spaces; character n-gram feature spaces; classification algorithms; feature space distribution; higher-order naive Bayes; independent and identically distributed; machine learning; naive Bayes statistical classification algorithm; precision metrics; textual datasets; Accuracy; Classification algorithms; Correlation; Earthquakes; Measurement; Niobium; Prediction algorithms;
Conference_Titel :
Technologies for Homeland Security (HST), 2015 IEEE International Symposium on
Conference_Location :
Waltham, MA
Print_ISBN :
978-1-4799-1736-5
DOI :
10.1109/THS.2015.7225274