Bilingual sentence matching using Kernel CCA

Author

Tripathi, Abhishek ; Klami, Arto ; Virpioja, Sami

Author_Institution

Dept. of Comput. Sci., Univ. of Helsinki, Helsinki, Finland

fYear

2010

fDate

Aug. 29 2010-Sept. 1 2010

Firstpage

130

Lastpage

135

Abstract

The problem of matching samples between two data sets is a fundamental task in unsupervised learning. In this paper we propose an algorithm based on statistical dependency between the data sets to solve the matching problem in a general case when samples in both data sets have different feature representations. As a concrete example, we consider the task of sentence-level alignment of parallel corpus based on monolingual data. Multilingual text collections with sentence-level alignment are required by statistical machine translation methods. We show how statistical dependencies between feature representations of partially aligned (e.g., paragraph-level alignment) corpora can be used to learn sentence-level alignment in a data-driven way. Our novel matching algorithm based on Kernel Canonical Correlation Analysis (KCCA) outperforms an earlier algorithm using linear CCA.

Keywords

language translation; natural language processing; statistical analysis; unsupervised learning; Kernel canonical correlation analysis; bilingual sentence matching; multilingual text collections; sentence-level alignment; statistical dependency; statistical machine translation methods; unsupervised learning; Accuracy; Convergence; Correlation; Cost function; Iterative methods; Kernel; Semantics;

fLanguage

English

Publisher

ieee

Conference_Titel

Machine Learning for Signal Processing (MLSP), 2010 IEEE International Workshop on

Conference_Location

Kittila

ISSN

1551-2541

Print_ISBN

978-1-4244-7875-0

Electronic_ISBN

1551-2541

Type

conf

DOI

10.1109/MLSP.2010.5589249

Filename

5589249