DocumentCode :
3531069
Title :
Incorporating monolingual corpora into bilingual latent semantic analysis for crosslingual LM adaptation
Author :
Tam, Yik-Cheung ; Schultz, Tanja
Author_Institution :
InterACT, Carnegie Mellon Univ., Pittsburgh, PA
fYear :
2009
fDate :
19-24 April 2009
Firstpage :
4821
Lastpage :
4824
Abstract :
The major limitation in bilingual latent semantic analysis (bLSA) is the requirement of parallel training corpora. Motivated by semi-supervised learning, we propose a clusterbased bLSA training approach to incorporate monolingual corpora. Treating each parallel document pair as centroids of the parallel document clusters, each monolingual document is associated to the closest centroid according to their topic similarity. The resulting parallel document clusters are used as constraints to enforce a one-to-one topic correspondence in variational EM. Slight performance improvement in crosslingual language model adaptation is observed compared to the baseline without monolingual corpora.
Keywords :
learning (artificial intelligence); speech processing; bilingual latent semantic analysis; crosslingual language model adaptation; monolingual corpora; monolingual document; parallel document clusters; parallel training corpora; semisupervised learning; variational EM; Adaptation model; Concatenated codes; Constraint theory; Contracts; Electrical capacitance tomography; Lagrangian functions; Semisupervised learning; Singular value decomposition; Surface-mount technology; Vocabulary; bilingual LSA; crosslingual LM adaptation; crosslingual word trigger; monolingual corpora;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Acoustics, Speech and Signal Processing, 2009. ICASSP 2009. IEEE International Conference on
Conference_Location :
Taipei
ISSN :
1520-6149
Print_ISBN :
978-1-4244-2353-8
Electronic_ISBN :
1520-6149
Type :
conf
DOI :
10.1109/ICASSP.2009.4960710
Filename :
4960710
Link To Document :
بازگشت