A Feasible Process For Mining Corpus From Web

Author

Chao Wang ; Dequan Zheng ; Tiejun Zhao ; Ji Guo

Author_Institution

MOE-MS Key Lab. of Natural Language Process. & Speech, Harbin Inst. of Technol., Harbin, China

Volume

9

fYear

2011

fDate

12-14 Aug. 2011

Abstract

Mining bilingual parallel sentence pair from Web data is the most effective way to get large-scale of bilingual corpus. In this paper, we put forward both the set of method and the series of process for extracting parallel sentence pair from nonspecific web date source. considering 1.1 billion page as the web data input, with a sequence of steps we get several sentences pair which has 81% recall and 85% precision, on this basis we bring up a parameter for measure quality of sentence pair. After filter sentence pair by this parameter, we get 850 thousand unique sentence pairs. On filtering by this parameter, the precision increase to 95%, meanwhile the recall only decrease by 1%.

Keywords

Internet; data mining; information retrieval; text analysis; Web data; bilingual corpus; bilingual parallel sentence pair mining; nonspecific Web date source; parallel sentence pair extraction; Accuracy; Data mining; Dictionaries; HTML; Patents; Radio access networks; Web pages;

fLanguage

English

Publisher

ieee

Conference_Titel

Electronic and Mechanical Engineering and Information Technology (EMEIT), 2011 International Conference on

Conference_Location

Harbin, Heilongjiang

Print_ISBN

978-1-61284-087-1

Type

conf

DOI

10.1109/EMEIT.2011.6080758

Filename

6080758