DocumentCode :
3008674
Title :
Class-dependent Canonical Correlation Analysis for scalable cross-lingual document categorization
Author :
Abdel Hady, Mohamed Farouk ; Asham, Mina
Author_Institution :
Adv. Technol. Lab., Microsoft Res., Cairo, Egypt
fYear :
2013
fDate :
16-19 April 2013
Firstpage :
308
Lastpage :
315
Abstract :
Canonical Correlation Analysis (CCA) is used to infer a semantic space into which text documents, written in different languages, can be mapped to a language-independent representation, called latent topics. This highly reduces the complexity of dealing with different languages since we can train a document classifier using the labeled documents in one language, and then apply it to classify documents in another language. This topic modeling task is usually performed in a class-independent manner. The performance of CCA depends on the amount of documents used for inferring the semantic space. However, CCA has a high computational complexity with respect to the number of training documents. In this paper, we proposed a scalable variant of CCA, CD-CCA, to improve its scalability and complexity where the projection is performed in a class-dependent manner. It generates a semantic space for each category separately. Then a binary document classifier is trained for each category on its own semantic space. CD-CCA was applied on English-Chinese document classification. The experimental results showed that CD-CCA can deal with large training sets without hurting the performance of the underlying classifiers compared to traditional CCA. CD-CCA opens the door for distributed training of the semantic spaces of the different categories.
Keywords :
computational complexity; correlation theory; natural language processing; pattern classification; programming language semantics; text analysis; training; CD-CCA; English-Chinese document classification; binary document classification; canonical correlation analysis; class dependent CCA; computational complexity; distributed training document; document labeling; language independent representation; latent topics; scalable cross lingual document categorization; semantic space; text document; topic modeling task; Correlation; Feature extraction; Semantics; Text categorization; Training; Web pages; Cross Lingual Text Classification; Multiligual; Topic Modeling;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Computational Intelligence and Data Mining (CIDM), 2013 IEEE Symposium on
Conference_Location :
Singapore
Type :
conf
DOI :
10.1109/CIDM.2013.6597252
Filename :
6597252
Link To Document :
بازگشت