DocumentCode
3088492
Title
A Machine Learning Based Language Specific Web Site Crawler
Author
Tadapak, Punnawat ; Suebchua, Thanaphon ; Rungsawang, Arnon
Author_Institution
Dept. of Comput. Eng., Kasetsart Univ., Bangkok, Thailand
fYear
2010
fDate
14-16 Sept. 2010
Firstpage
155
Lastpage
161
Abstract
We propose an approach for gathering web pages written in a specific language. The approach consists of a language predictor and a web site crawler. The language predictor is a machine learning based component that can learn from an example host graph some characteristics of relevant hosts, and is used to calculate the language degree of a web server whether it has a high probability to serve web pages written in a target language. The site crawler, on the other hand, chooses to download the web pages from a prioritized list of relevant servers. We have evaluated the crawling performance in terms of coverage and harvest rates. Preliminary experiments using a Thai web data set show a promising result, comparing with the traditional language-specific crawling methods recently proposed in the literatures.
Keywords
Internet; Web sites; information retrieval; learning (artificial intelligence); natural language interfaces; Web page; language predictor; language specific Web site crawler; machine learning; Crawlers; Feature extraction; Testing; Web pages; Web server; Language-specific web crawler; Machine-Learning; Web site crawler;
fLanguage
English
Publisher
ieee
Conference_Titel
Network-Based Information Systems (NBiS), 2010 13th International Conference on
Conference_Location
Takayama
ISSN
2157-0418
Print_ISBN
978-1-4244-8053-1
Electronic_ISBN
2157-0418
Type
conf
DOI
10.1109/NBiS.2010.25
Filename
5635898
Link To Document