DocumentCode :
2710784
Title :
Classifying High-Dimensional Text and Web Data Using Very Short Patterns
Author :
Malik, Hassan H. ; Kender, John R.
Author_Institution :
Dept. of Comput. Sci., Columbia Univ., New York, NY
fYear :
2008
fDate :
15-19 Dec. 2008
Firstpage :
923
Lastpage :
928
Abstract :
In this paper, we propose the "democratic classifier", a simple pattern-based classification algorithm that uses very short patterns for classification, and does not rely on the minimum support threshold. Borrowing ideas from democracy, our training phase allows each training instance to vote for an equal number of candidate size-2 patterns. The training instances select patterns by effectively balancing between local, class, and global significance of patterns. The selected patterns are simultaneously added to the model for all applicable classes and a novel power law based weighing scheme adjusts their weights with respect of each class. Results of experiments performed on 121 common text and Web datasets show that our algorithm almost always outperforms state of the art classification algorithms, without any parameter tuning. On 100 real-life Web datasets, the average absolute classification accuracy improvement was as great as 9.4% over SVM, Harmony, C4.5 and KNN. Also, our algorithm ran about 3.5 times faster than the fastest existing pattern-based classification algorithm.
Keywords :
Internet; classification; learning (artificial intelligence); text analysis; Web data classification; democratic classifier; high-dimensional text classification; machine learning; minimum support threshold; power law based weighing scheme; very short pattern-based classification; Classification algorithms; Data mining; Frequency; Humans; Machine learning algorithms; Nominations and elections; Qualifications; Support vector machine classification; Support vector machines; Voting; Classification; feature selection; interestingness measures; pattern-based classification; text classification;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Data Mining, 2008. ICDM '08. Eighth IEEE International Conference on
Conference_Location :
Pisa
ISSN :
1550-4786
Print_ISBN :
978-0-7695-3502-9
Type :
conf
DOI :
10.1109/ICDM.2008.139
Filename :
4781202
Link To Document :
بازگشت