DocumentCode
3273803
Title
Document Classification through Building Specified N-Gram
Author
Ko, Byeongkyu ; Choi, Dongjin ; Choi, Chang ; Choi, Junho ; Kim, Pankoo
Author_Institution
Dept. of Comput. Eng., Chosun Univ., Gwangju, South Korea
fYear
2012
fDate
4-6 July 2012
Firstpage
171
Lastpage
176
Abstract
This paper proposed a method to classify textural documents using specified n-gram data set. Human lives in the world where web documents have a great potential and the amount of valuable information has been consistently growing over the year. There is a problem that finding relevant web documents corresponding to what users want is more difficult due to the huge amount of web size. For this reason, many approaches have been suggested to overcome this obstacle. The most important task is classifying textural documents into predefined categories. Over the years, many statistical approaches were introduced though, no one can find perfect solution yet. In this paper, we suggest a method for textural document classification using n-gram model. The n-gram data frequency has a great potential to find similarities between documents. For this reason, we construct our own n-gram data sets from research papers. If an unknown document comes to the system, the system will extract n-grams from the given unknown documents. After this step, n-grams from unknown document and n-grams in previous data sets will be compared by proposed similarity measurement. The precision rate of this method comes to 86%.
Keywords
Internet; pattern classification; statistical analysis; text analysis; Web documents; document similarity measurement; n-gram data set frequency extraction; precision rate; statistical approaches; textural document classification; Buildings; Computers; Databases; Google; HTML; Support vector machines; Training; Document Classification; N-gram; NLP; Statistical Language Modeling;
fLanguage
English
Publisher
ieee
Conference_Titel
Innovative Mobile and Internet Services in Ubiquitous Computing (IMIS), 2012 Sixth International Conference on
Conference_Location
Palermo
Print_ISBN
978-1-4673-1328-5
Type
conf
DOI
10.1109/IMIS.2012.142
Filename
6296850
Link To Document