Title :
Multilingual and Hierarchical Classification of Large Datasets of Scientific Publications
Author :
Jaroslaw Protasiewicz;Tomasz Stanislawek;Slawomir Dadas
Author_Institution :
Lab. of Intell. Inf. Syst., Nat. Inf. Process. Inst., Warsaw, Poland
Abstract :
The aim of this paper was to propose a classification system composed of monolingual classifiers and a multilingual decision module, for handling large numbers of multilingual documents. The system was compared with two monolingual classifiers, respectively for English and Polish, and with the maximum probability model. The tests were carried out over multilingual documents that contained components of two languages, English and Polish. The conclusion was that the proposed system is capable to cope with the efficient categorization of a large number of documents related to assorted topics, and simultaneously contained components from many languages. Additional objectives were to examine of two ways of data representation, as well as hierarchical and horizontal approaches to classification, assuming that a structure of classes is hierarchical. The results showed that the document representation as separate features is better than a bag of words, and the flat approach is only slightly better than the hierarchical approach.
Keywords :
"Chlorine","Clustering algorithms","Indexes","Ontologies","Internet","Algorithm design and analysis","Support vector machines"
Conference_Titel :
Systems, Man, and Cybernetics (SMC), 2015 IEEE International Conference on
DOI :
10.1109/SMC.2015.294