DocumentCode :
3642096
Title :
An automated domain specific stop word generation method for natural language text classification
Author :
Hakan Ayral;Sirma Yavuz
Author_Institution :
Yildiz Technical University, Computer Engineering Department, 34349 Yildiz/Istanbul
fYear :
2011
fDate :
6/1/2011 12:00:00 AM
Firstpage :
500
Lastpage :
503
Abstract :
In this paper we propose an automated method for generating domain specific stop words to improve classification of natural language content. Also we implemented a bayesian natural language classifier working on web pages, which is based on maximum a posteriori probability estimation of keyword distributions using bag-of-words model to test the generated stop words. We investigated the distribution of stop-word lists generated by our model and compared their contents against a generic stop-word list for English language. We also show that the document coverage rank and topic coverage rank of words belonging to natural language corpora follow Zipf´s law, just like the word frequency rank is known to follow.
Keywords :
"Natural languages","Semantics","Sparse matrices","Bayesian methods","Arrays"
Publisher :
ieee
Conference_Titel :
Innovations in Intelligent Systems and Applications (INISTA), 2011 International Symposium on
Print_ISBN :
978-1-61284-919-5
Type :
conf
DOI :
10.1109/INISTA.2011.5946149
Filename :
5946149
Link To Document :
بازگشت