Bitext Dependency Parsing With Auto-Generated Bilingual Treebank

Author

Chen, Wenliang ; Kazama, Jun´ichi ; Zhang, Min ; Tsuruoka, Yoshimasa ; Zhang, Yujie ; Wang, Yiou ; Torisawa, Kentaro ; Li, Haizhou

Author_Institution

Dept. of Human Language Technol., Inst. for Infocomm Res., Singapore, Singapore

Volume

20

Issue

5

fYear

2012

fDate

7/1/2012 12:00:00 AM

Firstpage

1461

Lastpage

1472

Abstract

This paper proposes a method to improve the accuracy of bilingual texts (bitexts) dependency parsing by using an auto-generated bilingual treebank created with the help of statistical machine translation (SMT) systems. Previous bitext parsing methods use human-annotated bilingual treebanks that are costly and troublesome to obtain. In the proposed method, we use an auto-generated bilingual treebank to train the parsing models. First, an SMT system is used to translate a monolingual treebank into the target language; then, a monolingual parser for the target language is used to parse the translated sentences. Since the auto-translated sentences and auto-parsed trees in the auto-generated bilingual treebank are far from perfect, the bilingual constraints are not sufficiently reliable. To overcome this problem, we propose a method to verify the reliability of the constraints using a large amount of target monolingual and bilingual unannotated data. Finally, we design a set of effective bilingual features for parsing models on the basis of the verified constraints. We conduct the experiments using a standard test data. The experimental results show that our bitext parser significantly outperforms monolingual parsers. Moreover, our method is still able to provide improvement when we use a larger monolingual treebank containing over 50 000 sentences. We also test the proposed method with different SMT systems and the results show that our method is very robust to the noise. In particular, the proposed method can be used in a purely monolingual setting with the help of SMT. That is, it does not need the human translation of the test set as previous methods do.

Keywords

speech processing; statistical analysis; trees (mathematics); SMT systems; autogenerated bilingual treebank; autoparsed trees; autotranslated sentences; bilingual constraints; bilingual features; bilingual text dependency parsing; bilingual unannotated data; bitext dependency parsing; bitext parser; human-annotated bilingual treebanks; monolingual parsers; monolingual treebank; statistical machine translation systems; Data mining; Educational institutions; Humans; Materials; Noise; Reliability; Training; Dependency parsing; natural language processing; statistical machine translation; unannotated data;

fLanguage

English

Journal_Title

Audio, Speech, and Language Processing, IEEE Transactions on

Publisher

ieee

ISSN

1558-7916

Type

jour

DOI

10.1109/TASL.2011.2180898

Filename

6111269