Title :
A Feasible Process For Mining Corpus From Web
Author :
Chao Wang ; Dequan Zheng ; Tiejun Zhao ; Ji Guo
Author_Institution :
MOE-MS Key Lab. of Natural Language Process. & Speech, Harbin Inst. of Technol., Harbin, China
Abstract :
Mining bilingual parallel sentence pair from Web data is the most effective way to get large-scale of bilingual corpus. In this paper, we put forward both the set of method and the series of process for extracting parallel sentence pair from nonspecific web date source. considering 1.1 billion page as the web data input, with a sequence of steps we get several sentences pair which has 81% recall and 85% precision, on this basis we bring up a parameter for measure quality of sentence pair. After filter sentence pair by this parameter, we get 850 thousand unique sentence pairs. On filtering by this parameter, the precision increase to 95%, meanwhile the recall only decrease by 1%.
Keywords :
Internet; data mining; information retrieval; text analysis; Web data; bilingual corpus; bilingual parallel sentence pair mining; nonspecific Web date source; parallel sentence pair extraction; Accuracy; Data mining; Dictionaries; HTML; Patents; Radio access networks; Web pages;
Conference_Titel :
Electronic and Mechanical Engineering and Information Technology (EMEIT), 2011 International Conference on
Conference_Location :
Harbin, Heilongjiang
Print_ISBN :
978-1-61284-087-1
DOI :
10.1109/EMEIT.2011.6080758