Author/Authors :
Aubakirov, S.S. al-Farabi Kazakh National university, Almaty, Kazakhstan , Trigo, P. Instituto Superior de Engenharia de Lisboa Biosystems and Integrative Sciences - Institute Agent and Systems Modeling, Lisbon, Portugal , Ahmed-Zaki, D. Zh. al-Farabi Kazakh National university, Almaty, Kazakhstan
Abstract :
In this paper, we propose an optimization workflow to predict classifiers accuracy
based on the exploration of the space composed of different data features and the configu-
rations of the classification algorithms. The overall process is described considering the text
classification problem. We take three main features that affect text classification and there-
fore the accuracy of classifiers. The first feature considers the words that comprise the input
text; here we use the N-gram concept with different N values. The second feature considers
the adoption of textual pre-processing steps such as the stop-word filtering and stemming
techniques. The third feature considers the classification algorithms hyperparameters. In this
paper, we take the well-known classifiers K-Nearest Neighbors (KNN) and Naive Bayes (NB)
where K (from KNN) and a-priori probabilities (from NB) are hyperparameters that influence
accuracy. As a result, we explore the feature space (correlation among textual and classifier
aspects) and we present an approximation model that is able to predict classifiers accuracy.
Keywords :
text classification , learning algorithms , genetic algorithm , distributed computing