• DocumentCode
    2373583
  • Title

    Bilingual sentence matching using Kernel CCA

  • Author

    Tripathi, Abhishek ; Klami, Arto ; Virpioja, Sami

  • Author_Institution
    Dept. of Comput. Sci., Univ. of Helsinki, Helsinki, Finland
  • fYear
    2010
  • fDate
    Aug. 29 2010-Sept. 1 2010
  • Firstpage
    130
  • Lastpage
    135
  • Abstract
    The problem of matching samples between two data sets is a fundamental task in unsupervised learning. In this paper we propose an algorithm based on statistical dependency between the data sets to solve the matching problem in a general case when samples in both data sets have different feature representations. As a concrete example, we consider the task of sentence-level alignment of parallel corpus based on monolingual data. Multilingual text collections with sentence-level alignment are required by statistical machine translation methods. We show how statistical dependencies between feature representations of partially aligned (e.g., paragraph-level alignment) corpora can be used to learn sentence-level alignment in a data-driven way. Our novel matching algorithm based on Kernel Canonical Correlation Analysis (KCCA) outperforms an earlier algorithm using linear CCA.
  • Keywords
    language translation; natural language processing; statistical analysis; unsupervised learning; Kernel canonical correlation analysis; bilingual sentence matching; multilingual text collections; sentence-level alignment; statistical dependency; statistical machine translation methods; unsupervised learning; Accuracy; Convergence; Correlation; Cost function; Iterative methods; Kernel; Semantics;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Machine Learning for Signal Processing (MLSP), 2010 IEEE International Workshop on
  • Conference_Location
    Kittila
  • ISSN
    1551-2541
  • Print_ISBN
    978-1-4244-7875-0
  • Electronic_ISBN
    1551-2541
  • Type

    conf

  • DOI
    10.1109/MLSP.2010.5589249
  • Filename
    5589249