DocumentCode :
3060424
Title :
Web-based parallel corpora for statistical machine translation
Author :
Li, Bo ; Liu, Juan ; Shi, Wenjuan
Author_Institution :
Wuhan Univ., Wuhan
fYear :
2007
fDate :
13-15 Dec. 2007
Firstpage :
444
Lastpage :
449
Abstract :
Statistical machine translation is the state-of-the- art technique based on sentence-level aligned parallel corpora. The improvement of this kind of technique is constrained by the lack of parallel corpora publicly available. The booming of the World Wide Web stands a fair chance that we can construct parallel corpora in a big scale more easily. In this paper, we summarize the current strategies fetching parallel corpora from the Web and classify them into three classes: the structure-based, the content-based and the hybrid. We compare these approaches and bring out some ideas that may be useful for improving the performance of the algorithms. In the discussion section, we put forward some problems that should be considered in future research.
Keywords :
Internet; language translation; statistical analysis; Web-based parallel corpora; World Wide Web; content-based strategy; statistical machine translation; structure-based strategy; Application software; Computer science; Feeds; Law; Machine learning; Surface-mount technology; Uniform resource locators; Web pages; Web sites; World Wide Web;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Machine Learning and Applications, 2007. ICMLA 2007. Sixth International Conference on
Conference_Location :
Cincinnati, OH
Print_ISBN :
978-0-7695-3069-7
Type :
conf
DOI :
10.1109/ICMLA.2007.24
Filename :
4457270
Link To Document :
بازگشت