DocumentCode :
2373583
Title :
Bilingual sentence matching using Kernel CCA
Author :
Tripathi, Abhishek ; Klami, Arto ; Virpioja, Sami
Author_Institution :
Dept. of Comput. Sci., Univ. of Helsinki, Helsinki, Finland
fYear :
2010
fDate :
Aug. 29 2010-Sept. 1 2010
Firstpage :
130
Lastpage :
135
Abstract :
The problem of matching samples between two data sets is a fundamental task in unsupervised learning. In this paper we propose an algorithm based on statistical dependency between the data sets to solve the matching problem in a general case when samples in both data sets have different feature representations. As a concrete example, we consider the task of sentence-level alignment of parallel corpus based on monolingual data. Multilingual text collections with sentence-level alignment are required by statistical machine translation methods. We show how statistical dependencies between feature representations of partially aligned (e.g., paragraph-level alignment) corpora can be used to learn sentence-level alignment in a data-driven way. Our novel matching algorithm based on Kernel Canonical Correlation Analysis (KCCA) outperforms an earlier algorithm using linear CCA.
Keywords :
language translation; natural language processing; statistical analysis; unsupervised learning; Kernel canonical correlation analysis; bilingual sentence matching; multilingual text collections; sentence-level alignment; statistical dependency; statistical machine translation methods; unsupervised learning; Accuracy; Convergence; Correlation; Cost function; Iterative methods; Kernel; Semantics;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Machine Learning for Signal Processing (MLSP), 2010 IEEE International Workshop on
Conference_Location :
Kittila
ISSN :
1551-2541
Print_ISBN :
978-1-4244-7875-0
Electronic_ISBN :
1551-2541
Type :
conf
DOI :
10.1109/MLSP.2010.5589249
Filename :
5589249
Link To Document :
بازگشت