DocumentCode
2783830
Title
A high-precision forum crawler based on vertical crawling
Author
Gao, Qing ; Xiao, Bo ; Lin, Zhiqing ; Chen, Xiyao ; Zhou, Bing
Author_Institution
Pattern Recognition & Intell. Syst. Lab., Beijing Univ. of Posts & Telecommun., Beijing, China
fYear
2009
fDate
6-8 Nov. 2009
Firstpage
362
Lastpage
367
Abstract
In this paper, we present a special crawler for Internet forums. Different from general crawler and focused crawler, it can get structured information directly get the most valuable Web resources by utilizing the least system resources, filter the useless information to the maximum extent and finally supply users with high-precision information. This crawler adopts template-based processing method which is to use regular expressions to extract structured information. The URL queue is initialized by URLs set in seeds file and valuable URLs are extracted from Web pages and added into the queue during the crawling process. Once the time of one post is beyond the specified time span or the Web information is unchanged, the crawler can skip it in time to avoid wasting system´s resources. Experimental results demonstrate that our crawler can collect real-time forum information more efficiently and precisely than other crawlers.
Keywords
Internet; Web sites; search engines; Internet forums; URL queue; Web pages; Web resources; high-precision forum crawler; least system resources; structured information extraction; template-based processing method; vertical crawling; Blogs; Crawlers; Data mining; Discussion forums; Information filtering; Information filters; Internet; Search engines; Uniform resource locators; Web pages; forum; high-precision; structured information; template; vertical crawler;
fLanguage
English
Publisher
ieee
Conference_Titel
Network Infrastructure and Digital Content, 2009. IC-NIDC 2009. IEEE International Conference on
Conference_Location
Beijing
Print_ISBN
978-1-4244-4898-2
Electronic_ISBN
978-1-4244-4900-6
Type
conf
DOI
10.1109/ICNIDC.2009.5360990
Filename
5360990
Link To Document