DocumentCode
160998
Title
An Effective Forum Crawler
Author
Sreeja, S.R. ; Chaudhari, Sneha
Author_Institution
Dept. of Comput. Sci. & Eng., A.C. Patil Coll. of Eng., Navi Mumbai, India
fYear
2014
fDate
4-5 April 2014
Firstpage
230
Lastpage
234
Abstract
Web Forums or Internet Forums provide a space for users to share, discuss and request information. Web Forums are sources of huge amount of structured information that is rapidly changing. So crawling Web Forums require special softwares. A Generic Deep Web Crawler or a Focused Crawler cannot be used for this purpose. In this paper, we propose an effective Web Crawler especially for Internet Forums. This Forum Crawler overcomes the drawbacks of many of the existing Forum Crawlers. It has the ability to detect the Entry URL of a Forum site, given any page of it. Crawling process starting from Entry URL increases the coverage. Different URLs in the Web Forums are classified into four categories and our Forum Crawler is capable of detecting these URLs even if they are JavaScript-based which most of the existing Forum Crawlers cannot do. The entire process is divided into learning part and online crawling part. The learning part classifies different URLs in the forum site into four categories: Index URL, Thread URL, Index-Page-Turning URL and Thread-Page-Turning URL. This Forum Crawler uses a Freshness First Strategy rather than the BFS (Breadth First Strategy) for performing online crawling which is advantageous in situations where there are limited system resources available.
Keywords
Internet; Web sites; information retrieval; tree searching; BFS; Entry URL; Internet forum crawler; JavaScript-based URLs; Web Forum crawling; Web forum crawler; breadth first strategy; focused crawler; forum site; freshness first strategy; generic deep Web crawler; index URL; index-page-turning URL; information requesting; information sharing; online crawling; thread URL; thread-page-turning URL; Crawlers; Indexes; Information technology; Internet; Kernel; Web pages; JavaScript-based URLs; URL type; crawling strategy; forum crawling; page classification;
fLanguage
English
Publisher
ieee
Conference_Titel
Circuits, Systems, Communication and Information Technology Applications (CSCITA), 2014 International Conference on
Conference_Location
Mumbai
Type
conf
DOI
10.1109/CSCITA.2014.6839264
Filename
6839264
Link To Document