Title :
Dimensionality reduction in automated evaluation of descriptive answers through zero variance, near zero variance and non frequent words techniques - a comparison
Author :
Sunil Kumar C;R.J. RamaSree
Author_Institution :
Research and Development Center, Bharathiar University, Coimbatore, India
Abstract :
In this paper, we studied the performances of models when features that are not very common are not included to train the models used for automated evaluation of descriptive answers which can otherwise be viewed as a text classification problem. Two different techniques namely nonZeroVar and findFreqTerms were independently employed on the text corpus in order to identify the non-common features and eliminate them to attain dimensionality reduction. The implementation details of both the techniques were discussed. Models were built from reduced datasets and 10 fold cross validation was repeated 10 times in order to obtain the performance measurements. Quantitative analysis of the usefulness of these feature selection techniques were studied using ease of implementation of the technique, number of features retained post dimensionality reduction, accuracy, kappa, mean absolute error, F score and Area under the curve as performance metrics. Based on the measurements, it was concluded that both nonZeroVar and findFreqTerms techniques help with eliminating uncommon features from training set and doing so improved the model´s performance during auto evaluation of descriptive answers. Another significant conclusion is that nonZeroVar technique with its default values is quick and easy to use when compared with findFreqTerms technique that requires a trial and error approach to set the parameter values that yield optimal performance measurements. Though nonZeroVar technique is easy to implement, it does not compromise on any lesser performance when compared with performance yielded from models of findFreqTerms technique. A final inference is made that for dimensionality reduction of text classification, nonZeroVar technique is a better technique when compared to findFreqTerms technique.
Keywords :
"Frequency measurement","Accuracy","Area measurement"
Conference_Titel :
Intelligent Systems and Control (ISCO), 2015 IEEE 9th International Conference on
DOI :
10.1109/ISCO.2015.7282351