DocumentCode :
2359288
Title :
An Improvement of Centroid-Based Classification Algorithm for Text Classification
Author :
Cataltepe, Zehra ; Aygun, Eser
Author_Institution :
Istanbul Tech.Univ., Istanbul
fYear :
2007
fDate :
17-20 April 2007
Firstpage :
952
Lastpage :
956
Abstract :
k-nearest neighbor and centroid-based classification algorithms are frequently used in text classification due to their simplicity and performance. While k-nearest neighbor algorithm usually performs well in terms of accuracy, it is slow in recognition phase. Because the distances/similarities between the new data point to be recognized and all the training data need to be computed. On the other hand, centroid-based classification algorithms are very fast, because only as many distance/similarity computations as the number of centroids (i.e. classes) needs to be done. In this paper, we evaluate the performance of centroid-based classification algorithm and compare it to nearest mean and nearest neighbor algorithms on 9 data sets. We propose and evaluate an improvement on centroid based classification algorithm. Proposed algorithm starts from the centroids of each class and increases the weight of misclassified training data points on the centroid computation until the validation error starts increasing. The weight increase is done based on the training confusion matrix entries for misclassified points. Vie proposed algorithm results in smaller test error than centroid-based classification algorithm in 7 out of 9 data sets. It is also better than 10-nearest neighbor algorithm in 8 out of 9 data sets. We also evaluate different similarity metrics together with centroid and nearest neighbor algorithms. We find out that, when Euclidean distance is turned into a similarity measure using division as opposed to exponentiation. Euclidean-based similarity can perform almost as good as cosine similarity.
Keywords :
pattern classification; text analysis; Euclidean-based similarity; centroid-based classification algorithm; k-nearest neighbor algorithm; text classification; Classification algorithms; Clustering algorithms; Euclidean distance; Frequency; Internet; Nearest neighbor searches; Performance evaluation; Testing; Text categorization; Training data;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Data Engineering Workshop, 2007 IEEE 23rd International Conference on
Conference_Location :
Istanbul
Print_ISBN :
978-1-4244-0832-0
Electronic_ISBN :
978-1-4244-0832-0
Type :
conf
DOI :
10.1109/ICDEW.2007.4401090
Filename :
4401090
Link To Document :
بازگشت