Automatic Construction of Web-Based English/Chinese Parallel Corpora

Author

Tan Bin ; Zhou Xu-yan

Author_Institution

Dept. of Comput., Jingganshan Univ., Ji´an, China

fYear

2010

fDate

2-4 April 2010

Firstpage

114

Lastpage

117

Abstract

As the demand for global information increases significantly, multilingual corpora has become a valuable linguistic resource for applications to cross-lingual information retrieval and natural language processing. A Web-based English-Chinese bilingual parallel corpus of automatic Construction Technology solved the shortage of bilingual English-Chinese Parallel Corpus. First, some web pages which may be set translation dig of from a particular source, and then from the web pages focused on the external characteristics according to the similarity to extract the candidate web pages in parallel pairs, use of content-based methods on parallel web pages for each of these candidates assessed. In the assessment of the candidate pairs of parallel web pages, this paper design ECVS models of bilingual text similarity assessed based on the classic vector space model.

Keywords

Internet; content-based retrieval; natural language processing; English-Chinese parallel corpora; Web pages; Web-based parallel corpora; automatic construction technology; bilingual text similarity; content-based methods; cross-lingual information retrieval; multilingual corpora; natural language processing; vector space model; Informatics; Information security; Information technology; Jacobi correlation coefficient; Parallel corpora; vector space;

fLanguage

English

Publisher

ieee

Conference_Titel

Intelligent Information Technology and Security Informatics (IITSI), 2010 Third International Symposium on

Conference_Location

Jinggangshan

Print_ISBN

978-1-4244-6730-3

Electronic_ISBN

978-1-4244-6743-3

Type

conf

DOI

10.1109/IITSI.2010.124

Filename

5453637