DocumentCode :
1404913
Title :
Bitext Dependency Parsing With Auto-Generated Bilingual Treebank
Author :
Chen, Wenliang ; Kazama, Jun´ichi ; Zhang, Min ; Tsuruoka, Yoshimasa ; Zhang, Yujie ; Wang, Yiou ; Torisawa, Kentaro ; Li, Haizhou
Author_Institution :
Dept. of Human Language Technol., Inst. for Infocomm Res., Singapore, Singapore
Volume :
20
Issue :
5
fYear :
2012
fDate :
7/1/2012 12:00:00 AM
Firstpage :
1461
Lastpage :
1472
Abstract :
This paper proposes a method to improve the accuracy of bilingual texts (bitexts) dependency parsing by using an auto-generated bilingual treebank created with the help of statistical machine translation (SMT) systems. Previous bitext parsing methods use human-annotated bilingual treebanks that are costly and troublesome to obtain. In the proposed method, we use an auto-generated bilingual treebank to train the parsing models. First, an SMT system is used to translate a monolingual treebank into the target language; then, a monolingual parser for the target language is used to parse the translated sentences. Since the auto-translated sentences and auto-parsed trees in the auto-generated bilingual treebank are far from perfect, the bilingual constraints are not sufficiently reliable. To overcome this problem, we propose a method to verify the reliability of the constraints using a large amount of target monolingual and bilingual unannotated data. Finally, we design a set of effective bilingual features for parsing models on the basis of the verified constraints. We conduct the experiments using a standard test data. The experimental results show that our bitext parser significantly outperforms monolingual parsers. Moreover, our method is still able to provide improvement when we use a larger monolingual treebank containing over 50 000 sentences. We also test the proposed method with different SMT systems and the results show that our method is very robust to the noise. In particular, the proposed method can be used in a purely monolingual setting with the help of SMT. That is, it does not need the human translation of the test set as previous methods do.
Keywords :
speech processing; statistical analysis; trees (mathematics); SMT systems; autogenerated bilingual treebank; autoparsed trees; autotranslated sentences; bilingual constraints; bilingual features; bilingual text dependency parsing; bilingual unannotated data; bitext dependency parsing; bitext parser; human-annotated bilingual treebanks; monolingual parsers; monolingual treebank; statistical machine translation systems; Data mining; Educational institutions; Humans; Materials; Noise; Reliability; Training; Dependency parsing; natural language processing; statistical machine translation; unannotated data;
fLanguage :
English
Journal_Title :
Audio, Speech, and Language Processing, IEEE Transactions on
Publisher :
ieee
ISSN :
1558-7916
Type :
jour
DOI :
10.1109/TASL.2011.2180898
Filename :
6111269
Link To Document :
بازگشت