• DocumentCode
    475320
  • Title

    Naïve bayes based language-specific web crawling

  • Author

    Srisukha, Ekkasit ; Jinarat, Supakpong ; Haruechaiyasak, Choochart ; Rungsawang, Arnon

  • Author_Institution
    Dept. of Comput. Eng., Kasetsart Univ., Bangkok
  • Volume
    1
  • fYear
    2008
  • fDate
    14-17 May 2008
  • Firstpage
    113
  • Lastpage
    116
  • Abstract
    In this paper, we propose a Thai language specific Web crawling as a method of selectively seek out Web pages written in Thai. The strategy is to follow a URL with the highest probability of leading to Thai Web pages. The probability score is calculated from the example set of Web pages using simple Naive Bayes approach. In addition, we also use a heuristic based method to bias the probable URLs whose hosts have previously provided Thai Web pages. An experiment illustrated that the proposed method produces a high harvest rate and achieves a better coverage than the others.
  • Keywords
    Bayes methods; Internet; natural language processing; optimisation; probability; search engines; Naive Bayes; Thai language specific Web crawling; URL; Web pages; heuristic based method; probability score; Crawlers; Encoding; Grid computing; HTML; Internet; Knowledge engineering; Probability; Research and development; Uniform resource locators; Web pages; Thai language; language identification; priority queue ordering; web archive; web crawler; web crawling;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology, 2008. ECTI-CON 2008. 5th International Conference on
  • Conference_Location
    Krabi
  • Print_ISBN
    978-1-4244-2101-5
  • Electronic_ISBN
    978-1-4244-2102-2
  • Type

    conf

  • DOI
    10.1109/ECTICON.2008.4600385
  • Filename
    4600385