Title :
An automated domain specific stop word generation method for natural language text classification
Author :
Hakan Ayral;Sirma Yavuz
Author_Institution :
Yildiz Technical University, Computer Engineering Department, 34349 Yildiz/Istanbul
fDate :
6/1/2011 12:00:00 AM
Abstract :
In this paper we propose an automated method for generating domain specific stop words to improve classification of natural language content. Also we implemented a bayesian natural language classifier working on web pages, which is based on maximum a posteriori probability estimation of keyword distributions using bag-of-words model to test the generated stop words. We investigated the distribution of stop-word lists generated by our model and compared their contents against a generic stop-word list for English language. We also show that the document coverage rank and topic coverage rank of words belonging to natural language corpora follow Zipf´s law, just like the word frequency rank is known to follow.
Keywords :
"Natural languages","Semantics","Sparse matrices","Bayesian methods","Arrays"
Conference_Titel :
Innovations in Intelligent Systems and Applications (INISTA), 2011 International Symposium on
Print_ISBN :
978-1-61284-919-5
DOI :
10.1109/INISTA.2011.5946149