Automatic Acquisition of Large-Scale Academic Bilingual Parallel Corpus from the Web

Author

Yong, Han ; Yu, Li ; Xiaoning, He ; Muyun, Yang ; Guohua, Lei

Author_Institution

Comput. Sci. & Technol. Dept., Heilongjiang Inst. of Technol., Harbin, China

fYear

2009

fDate

7-9 Dec. 2009

Firstpage

318

Lastpage

321

Abstract

In this paper, we describe a system which automatically acquires large-scale Chinese-English bilingual parallel corpus from China Journals Full-text Database (CJFD), a component of China National Knowledge Infrastructure (CNKI). The system gets large amount of parallel texts with domain information from the existing structured bilingual texts in CJFD, such as Chinese and English abstracts and titles of academic articles. The acquired Chinese-English parallel corpus is by several orders of magnitudes larger than similar corpus we have known before. In addition, this system collects a large amount of bilingual terms which can directly apply to lexical acquisition.

Keywords

Internet; data acquisition; linguistics; text analysis; China National Knowledge Infrastructure; Web; domain information; large-scale Chinese-English bilingual parallel corpus; lexical acquisition; parallel texts; Abstracts; Computer science; Data analysis; Data mining; Databases; Helium; Information analysis; Large-scale systems; Natural language processing; Web pages; bilingual parallel corpora acquision; bilingual term acquision; data mining;

fLanguage

English

Publisher

ieee

Conference_Titel

Asian Language Processing, 2009. IALP '09. International Conference on

Conference_Location

Singapore

Print_ISBN

978-0-7695-3904-1

Type

conf

DOI

10.1109/IALP.2009.75

Filename

5380757