DocumentCode :
552406
Title :
N-gram based text classification for Persian newspaper corpus
Author :
Farhoodi, Mojgan ; Yari, Alireza ; Sayah, Ali
Author_Institution :
Iran Telecommun. Res. Center, Tehran, Iran
fYear :
2011
fDate :
16-18 Aug. 2011
Firstpage :
55
Lastpage :
59
Abstract :
Statistical n-gram language modeling is applied in many domains like speech recognition, language identification, machine translation, character recognition and topic classification. Most language modeling approaches work on n-grams of words. In this paper, we employ language models classifier based on word level n-grams for Persian text classification. The presented approach computes the occurrence probability on word sequence in training data. Then by extracting the word sequence in test data, it can predict the highest probability for related class to given news text. We show that statistical language modeling can significantly cause high classification performance. The experimental results on Hamshahri corpus show satisfactory results and n-grams of length 3 are the most useful for Persian text classification.
Keywords :
computational linguistics; pattern classification; probability; publishing; text analysis; Hamshahri corpus; Persian newspaper corpus; Persian text classification; language models classifier; n-gram based text classification; occurrence probability; statistical n-gram language modeling; word sequence extraction; Accuracy; Computational modeling; Equations; Mathematical model; Smoothing methods; Text categorization; Training; Hamshahri courpus; N-gram; Persian text classification; Smoothing methods; language modeling;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Digital Content, Multimedia Technology and its Applications (IDCTA), 2011 7th International Conference on
Conference_Location :
Busan
Print_ISBN :
978-1-4577-0473-4
Electronic_ISBN :
978-89-88678-47-3
Type :
conf
Filename :
6016631
Link To Document :
بازگشت