• DocumentCode
    2332316
  • Title

    A Feasible Process For Mining Corpus From Web

  • Author

    Chao Wang ; Dequan Zheng ; Tiejun Zhao ; Ji Guo

  • Author_Institution
    MOE-MS Key Lab. of Natural Language Process. & Speech, Harbin Inst. of Technol., Harbin, China
  • Volume
    9
  • fYear
    2011
  • fDate
    12-14 Aug. 2011
  • Abstract
    Mining bilingual parallel sentence pair from Web data is the most effective way to get large-scale of bilingual corpus. In this paper, we put forward both the set of method and the series of process for extracting parallel sentence pair from nonspecific web date source. considering 1.1 billion page as the web data input, with a sequence of steps we get several sentences pair which has 81% recall and 85% precision, on this basis we bring up a parameter for measure quality of sentence pair. After filter sentence pair by this parameter, we get 850 thousand unique sentence pairs. On filtering by this parameter, the precision increase to 95%, meanwhile the recall only decrease by 1%.
  • Keywords
    Internet; data mining; information retrieval; text analysis; Web data; bilingual corpus; bilingual parallel sentence pair mining; nonspecific Web date source; parallel sentence pair extraction; Accuracy; Data mining; Dictionaries; HTML; Patents; Radio access networks; Web pages;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Electronic and Mechanical Engineering and Information Technology (EMEIT), 2011 International Conference on
  • Conference_Location
    Harbin, Heilongjiang
  • Print_ISBN
    978-1-61284-087-1
  • Type

    conf

  • DOI
    10.1109/EMEIT.2011.6080758
  • Filename
    6080758