مرکز منطقه ای اطلاع رساني علوم و فناوري - An automated domain specific stop word generation method for natural language text classification

DocumentCode :

3642096

Title :

An automated domain specific stop word generation method for natural language text classification

Author :

Hakan Ayral;Sirma Yavuz

Author_Institution :

Yildiz Technical University, Computer Engineering Department, 34349 Yildiz/Istanbul

fYear :

2011

fDate :

6/1/2011 12:00:00 AM

Firstpage :

500

Lastpage :

503

Abstract :

In this paper we propose an automated method for generating domain specific stop words to improve classification of natural language content. Also we implemented a bayesian natural language classifier working on web pages, which is based on maximum a posteriori probability estimation of keyword distributions using bag-of-words model to test the generated stop words. We investigated the distribution of stop-word lists generated by our model and compared their contents against a generic stop-word list for English language. We also show that the document coverage rank and topic coverage rank of words belonging to natural language corpora follow Zipf´s law, just like the word frequency rank is known to follow.

Keywords :

"Natural languages","Semantics","Sparse matrices","Bayesian methods","Arrays"

Publisher :

ieee

Conference_Titel :

Innovations in Intelligent Systems and Applications (INISTA), 2011 International Symposium on

Print_ISBN :

978-1-61284-919-5

Type :

conf

DOI :

10.1109/INISTA.2011.5946149

Filename :

5946149

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=3642096