DocumentCode
475320
Title
Naïve bayes based language-specific web crawling
Author
Srisukha, Ekkasit ; Jinarat, Supakpong ; Haruechaiyasak, Choochart ; Rungsawang, Arnon
Author_Institution
Dept. of Comput. Eng., Kasetsart Univ., Bangkok
Volume
1
fYear
2008
fDate
14-17 May 2008
Firstpage
113
Lastpage
116
Abstract
In this paper, we propose a Thai language specific Web crawling as a method of selectively seek out Web pages written in Thai. The strategy is to follow a URL with the highest probability of leading to Thai Web pages. The probability score is calculated from the example set of Web pages using simple Naive Bayes approach. In addition, we also use a heuristic based method to bias the probable URLs whose hosts have previously provided Thai Web pages. An experiment illustrated that the proposed method produces a high harvest rate and achieves a better coverage than the others.
Keywords
Bayes methods; Internet; natural language processing; optimisation; probability; search engines; Naive Bayes; Thai language specific Web crawling; URL; Web pages; heuristic based method; probability score; Crawlers; Encoding; Grid computing; HTML; Internet; Knowledge engineering; Probability; Research and development; Uniform resource locators; Web pages; Thai language; language identification; priority queue ordering; web archive; web crawler; web crawling;
fLanguage
English
Publisher
ieee
Conference_Titel
Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology, 2008. ECTI-CON 2008. 5th International Conference on
Conference_Location
Krabi
Print_ISBN
978-1-4244-2101-5
Electronic_ISBN
978-1-4244-2102-2
Type
conf
DOI
10.1109/ECTICON.2008.4600385
Filename
4600385
Link To Document