• DocumentCode
    3251704
  • Title

    Adaptive parallel sentences mining from web bilingual news collection

  • Author

    Zhao, Bing ; Vogel, Stephan

  • Author_Institution
    Sch. of Comput. Sci., Carnegie Mellon Univ., Pittsburgh, PA, USA
  • fYear
    2002
  • fDate
    2002
  • Firstpage
    745
  • Lastpage
    748
  • Abstract
    In this paper a robust, adaptive approach for mining parallel sentences from a bilingual comparable news collection is described Sentence length models and lexicon-based models are combined under a maximum likelihood criterion. Specific models are proposed to handle insertions and deletions that are frequent in bilingual data collected from the web. The proposed approach is adaptive, updating the translation lexicon iteratively using the mined parallel data to get better vocabulary coverage and translation probability parameter estimation. Experiments are carried out on 10 years of Xinhua bilingual news collection. Using the mined data, we get significant improvement in word-to-word alignment accuracy in machine translation modeling.
  • Keywords
    data mining; dynamic programming; language translation; maximum likelihood estimation; Web bilingual news collection; Xinhua bilingual news collection; adaptive approach; adaptive parallel sentences mining; lexicon-based models; machine translation modeling; maximum likelihood criterion; mined parallel data; sentence length models; translation probability parameter estimation; vocabulary coverage; Computer science; Information retrieval; Maximum likelihood estimation; Natural language processing; Natural languages; Parameter estimation; Probability; Robustness; Vocabulary; Web pages;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Data Mining, 2002. ICDM 2003. Proceedings. 2002 IEEE International Conference on
  • Print_ISBN
    0-7695-1754-4
  • Type

    conf

  • DOI
    10.1109/ICDM.2002.1184044
  • Filename
    1184044