DocumentCode :
1427045
Title :
Word-Map Systems for Content-Based Document Classification
Author :
Tsimboukakis, Nikos ; Tambouratzis, George
Author_Institution :
Inst. for Language & Speech Process., Athens, Greece
Volume :
41
Issue :
5
fYear :
2011
Firstpage :
662
Lastpage :
673
Abstract :
The main purpose of this paper is the classification of documents in terms of their content. Two systems are presented here that share a two-level architecture that include 1) a word map created via unsupervised learning that functions as a document-representation module and 2) a supervised multilayer-perceptron-based classifier. Two approaches to create word maps are presented and compared; these are based on hidden Markov models (HMMs) and the self-organizing map. A series of experiments is performed on several datasets of text-only documents, which are written in either Greek or in English. A comparison with established methods, such as the support-vector machine (SVM), illustrates the effectiveness of the proposed systems.
Keywords :
content management; hidden Markov models; multilayer perceptrons; natural language processing; pattern classification; self-organising feature maps; text analysis; unsupervised learning; English; Greek; content based document classification; document representation module; hidden Markov models; self-organizing map; supervised multilayer perceptron based classifier; text only document; two-level architecture; unsupervised learning; word map; Hidden Markov models; Neural networks; Self organizing feature maps; Support vector machines; Text processing; Training; Hidden Markov models (HMMs); neural-network applications; self-organizing feature maps; text processing;
fLanguage :
English
Journal_Title :
Systems, Man, and Cybernetics, Part C: Applications and Reviews, IEEE Transactions on
Publisher :
ieee
ISSN :
1094-6977
Type :
jour
DOI :
10.1109/TSMCC.2010.2096416
Filename :
5688251
Link To Document :
بازگشت