Title :
A high-precision forum crawler based on vertical crawling
Author :
Gao, Qing ; Xiao, Bo ; Lin, Zhiqing ; Chen, Xiyao ; Zhou, Bing
Author_Institution :
Pattern Recognition & Intell. Syst. Lab., Beijing Univ. of Posts & Telecommun., Beijing, China
Abstract :
In this paper, we present a special crawler for Internet forums. Different from general crawler and focused crawler, it can get structured information directly get the most valuable Web resources by utilizing the least system resources, filter the useless information to the maximum extent and finally supply users with high-precision information. This crawler adopts template-based processing method which is to use regular expressions to extract structured information. The URL queue is initialized by URLs set in seeds file and valuable URLs are extracted from Web pages and added into the queue during the crawling process. Once the time of one post is beyond the specified time span or the Web information is unchanged, the crawler can skip it in time to avoid wasting system´s resources. Experimental results demonstrate that our crawler can collect real-time forum information more efficiently and precisely than other crawlers.
Keywords :
Internet; Web sites; search engines; Internet forums; URL queue; Web pages; Web resources; high-precision forum crawler; least system resources; structured information extraction; template-based processing method; vertical crawling; Blogs; Crawlers; Data mining; Discussion forums; Information filtering; Information filters; Internet; Search engines; Uniform resource locators; Web pages; forum; high-precision; structured information; template; vertical crawler;
Conference_Titel :
Network Infrastructure and Digital Content, 2009. IC-NIDC 2009. IEEE International Conference on
Conference_Location :
Beijing
Print_ISBN :
978-1-4244-4898-2
Electronic_ISBN :
978-1-4244-4900-6
DOI :
10.1109/ICNIDC.2009.5360990