• DocumentCode
    3560606
  • Title

    A Maximum-Entropy Segmentation Model for Statistical Machine Translation

  • Author

    Deyi Xiong ; Min Zhang ; Haizhou Li

  • Author_Institution
    Dept. of Human Language Technol., Inst. for Infocomm Res., Singapore, Singapore
  • Volume
    19
  • Issue
    8
  • fYear
    2011
  • Firstpage
    2494
  • Lastpage
    2505
  • Abstract
    Segmentation is of great importance to statistical machine translation. It splits a source sentence into sequences of translatable segments. We propose a maximum-entropy segmentation model to capture desirable phrasal and hierarchical segmentations for statistical machine translation. We present an approach to automatically learning the beginning and ending boundaries of cohesive segments from word-aligned bilingual data without using any additional resources. The learned boundaries are then used to define cohesive segments in both phrasal and hierarchical segmentations. We integrate the segmentation model into phrasal statistical machine translation (SMT) and conduct experiments on the newswire and broadcast news domain to investigate the effectiveness of the proposed segmentation model on a large-scale training data. Our experimental results show that the maximum-entropy segmentation model significantly improves translation quality in terms of BLEU. We further validate that 1) the proposed segmentation model significantly outperforms syntactic constraints which are used in previous work to constrain segmentations; and 2) it is necessary to capture hierarchical segmentations besides phrasal segmentations.
  • Keywords
    computational linguistics; language translation; maximum entropy methods; statistical analysis; SMT; cohesive segments; desirable phrasal segmentations; hierarchical segmentations; large-scale training data; learned boundary; maximum-entropy segmentation model; phrasal statistical machine translation; source sentence; syntactic constraints; translatable segments; translation quality; word-aligned bilingual data; Decoding; Entropy; Feature extraction; Syntactics; Training; Training data; Bracketing transduction grammar (BTG)-based phrasal machine translation; hierarchical segmentation; maximum entropy; phrasal segmentation; statistical machine translation (SMT);
  • fLanguage
    English
  • Journal_Title
    Audio, Speech, and Language Processing, IEEE Transactions on
  • Publisher
    ieee
  • Conference_Location
    4/21/2011 12:00:00 AM
  • ISSN
    1558-7916
  • Type

    jour

  • DOI
    10.1109/TASL.2011.2144971
  • Filename
    5753927