DocumentCode
2332316
Title
A Feasible Process For Mining Corpus From Web
Author
Chao Wang ; Dequan Zheng ; Tiejun Zhao ; Ji Guo
Author_Institution
MOE-MS Key Lab. of Natural Language Process. & Speech, Harbin Inst. of Technol., Harbin, China
Volume
9
fYear
2011
fDate
12-14 Aug. 2011
Abstract
Mining bilingual parallel sentence pair from Web data is the most effective way to get large-scale of bilingual corpus. In this paper, we put forward both the set of method and the series of process for extracting parallel sentence pair from nonspecific web date source. considering 1.1 billion page as the web data input, with a sequence of steps we get several sentences pair which has 81% recall and 85% precision, on this basis we bring up a parameter for measure quality of sentence pair. After filter sentence pair by this parameter, we get 850 thousand unique sentence pairs. On filtering by this parameter, the precision increase to 95%, meanwhile the recall only decrease by 1%.
Keywords
Internet; data mining; information retrieval; text analysis; Web data; bilingual corpus; bilingual parallel sentence pair mining; nonspecific Web date source; parallel sentence pair extraction; Accuracy; Data mining; Dictionaries; HTML; Patents; Radio access networks; Web pages;
fLanguage
English
Publisher
ieee
Conference_Titel
Electronic and Mechanical Engineering and Information Technology (EMEIT), 2011 International Conference on
Conference_Location
Harbin, Heilongjiang
Print_ISBN
978-1-61284-087-1
Type
conf
DOI
10.1109/EMEIT.2011.6080758
Filename
6080758
Link To Document