Title :
Naïve bayes based language-specific web crawling
Author :
Srisukha, Ekkasit ; Jinarat, Supakpong ; Haruechaiyasak, Choochart ; Rungsawang, Arnon
Author_Institution :
Dept. of Comput. Eng., Kasetsart Univ., Bangkok
Abstract :
In this paper, we propose a Thai language specific Web crawling as a method of selectively seek out Web pages written in Thai. The strategy is to follow a URL with the highest probability of leading to Thai Web pages. The probability score is calculated from the example set of Web pages using simple Naive Bayes approach. In addition, we also use a heuristic based method to bias the probable URLs whose hosts have previously provided Thai Web pages. An experiment illustrated that the proposed method produces a high harvest rate and achieves a better coverage than the others.
Keywords :
Bayes methods; Internet; natural language processing; optimisation; probability; search engines; Naive Bayes; Thai language specific Web crawling; URL; Web pages; heuristic based method; probability score; Crawlers; Encoding; Grid computing; HTML; Internet; Knowledge engineering; Probability; Research and development; Uniform resource locators; Web pages; Thai language; language identification; priority queue ordering; web archive; web crawler; web crawling;
Conference_Titel :
Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology, 2008. ECTI-CON 2008. 5th International Conference on
Conference_Location :
Krabi
Print_ISBN :
978-1-4244-2101-5
Electronic_ISBN :
978-1-4244-2102-2
DOI :
10.1109/ECTICON.2008.4600385