• DocumentCode
    2348686
  • Title

    A method of mining bilingual resources from Web Based on Maximum Frequent Sequential Pattern

  • Author

    Zhang, Guiping ; Luo, Yang ; Ji, Duo

  • Author_Institution
    Knowledge Eng. Res. Center, Shenyang Aerosp. Univ., Shenyang, China
  • fYear
    2010
  • fDate
    21-23 Aug. 2010
  • Firstpage
    1
  • Lastpage
    8
  • Abstract
    The bilingual resources are indispensable and vital resources in the NPL fields, such as machine translation, etc. A large amount of electronic information is embedded in the Internet, which can be used as a potential information source of large-scale multi-language corpus, so it is a potential and feasible way to mine a great capacity of true bilingual resources from the Web. This paper proposes a method of mining bilingual resources from the Web based on Maximum Frequent Sequential Pattern. The method uses the heuristic approach to search and filter the candidate bilingual web pages, then mines patterns using maximum frequent sequential, and uses a machine learning method for extending the pattern base and verifying bilingual resources in accordance with the Japanese to Chinese word proportion. The experimental results indicate that the method could extract bilingual resources efficiently, with the precision rate over 90%.
  • Keywords
    Internet; data mining; language translation; natural language processing; Internet; Japanese to Chinese word proportion; NPL fields; bilingual Web pages; bilingual resources mining; machine translation; maximum frequent sequential pattern; multilanguage corpus; Aerospace engineering; Artificial neural networks; Information filters; Knowledge engineering; Bilingual corpus; Maximum Frequent Sequential Pattern; Pattern base; Web mining;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Natural Language Processing and Knowledge Engineering (NLP-KE), 2010 International Conference on
  • Conference_Location
    Beijing
  • Print_ISBN
    978-1-4244-6896-6
  • Type

    conf

  • DOI
    10.1109/NLPKE.2010.5587831
  • Filename
    5587831