Title :
Automatic Acquisition of Large-Scale Academic Bilingual Parallel Corpus from the Web
Author :
Yong, Han ; Yu, Li ; Xiaoning, He ; Muyun, Yang ; Guohua, Lei
Author_Institution :
Comput. Sci. & Technol. Dept., Heilongjiang Inst. of Technol., Harbin, China
Abstract :
In this paper, we describe a system which automatically acquires large-scale Chinese-English bilingual parallel corpus from China Journals Full-text Database (CJFD), a component of China National Knowledge Infrastructure (CNKI). The system gets large amount of parallel texts with domain information from the existing structured bilingual texts in CJFD, such as Chinese and English abstracts and titles of academic articles. The acquired Chinese-English parallel corpus is by several orders of magnitudes larger than similar corpus we have known before. In addition, this system collects a large amount of bilingual terms which can directly apply to lexical acquisition.
Keywords :
Internet; data acquisition; linguistics; text analysis; China National Knowledge Infrastructure; Web; domain information; large-scale Chinese-English bilingual parallel corpus; lexical acquisition; parallel texts; Abstracts; Computer science; Data analysis; Data mining; Databases; Helium; Information analysis; Large-scale systems; Natural language processing; Web pages; bilingual parallel corpora acquision; bilingual term acquision; data mining;
Conference_Titel :
Asian Language Processing, 2009. IALP '09. International Conference on
Conference_Location :
Singapore
Print_ISBN :
978-0-7695-3904-1
DOI :
10.1109/IALP.2009.75