• DocumentCode
    2783830
  • Title

    A high-precision forum crawler based on vertical crawling

  • Author

    Gao, Qing ; Xiao, Bo ; Lin, Zhiqing ; Chen, Xiyao ; Zhou, Bing

  • Author_Institution
    Pattern Recognition & Intell. Syst. Lab., Beijing Univ. of Posts & Telecommun., Beijing, China
  • fYear
    2009
  • fDate
    6-8 Nov. 2009
  • Firstpage
    362
  • Lastpage
    367
  • Abstract
    In this paper, we present a special crawler for Internet forums. Different from general crawler and focused crawler, it can get structured information directly get the most valuable Web resources by utilizing the least system resources, filter the useless information to the maximum extent and finally supply users with high-precision information. This crawler adopts template-based processing method which is to use regular expressions to extract structured information. The URL queue is initialized by URLs set in seeds file and valuable URLs are extracted from Web pages and added into the queue during the crawling process. Once the time of one post is beyond the specified time span or the Web information is unchanged, the crawler can skip it in time to avoid wasting system´s resources. Experimental results demonstrate that our crawler can collect real-time forum information more efficiently and precisely than other crawlers.
  • Keywords
    Internet; Web sites; search engines; Internet forums; URL queue; Web pages; Web resources; high-precision forum crawler; least system resources; structured information extraction; template-based processing method; vertical crawling; Blogs; Crawlers; Data mining; Discussion forums; Information filtering; Information filters; Internet; Search engines; Uniform resource locators; Web pages; forum; high-precision; structured information; template; vertical crawler;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Network Infrastructure and Digital Content, 2009. IC-NIDC 2009. IEEE International Conference on
  • Conference_Location
    Beijing
  • Print_ISBN
    978-1-4244-4898-2
  • Electronic_ISBN
    978-1-4244-4900-6
  • Type

    conf

  • DOI
    10.1109/ICNIDC.2009.5360990
  • Filename
    5360990