Title :
Learning based web crawl forum
Author :
Hemakumar, K. ; Prakash, B.
Abstract :
The main objective of this project is to crawl applicable forum content from the web with minimal overhead. Forum threads usually contain the information content that is the target of the forum crawlers. The system that is to be created for learn URL patterns across multiple sites and automatically finds a forum´s entry page given a page from the forum. The forum has different layouts, styles and a generic crawler that blindly follows the duplicate links and uninformative page will crawl duplicate pages. The test results will show that the proposed system achieved effectiveness and coverage on a large set of test forums.
Keywords :
data mining; social networking (online); URL patterns; data mining; duplicate links; forum content; forum crawlers; forum entry page; forum layouts; forum styles; forum threads; generic crawler; information content; learning based Web crawl forum; uninformative page; Crawlers; Data mining; Educational institutions; Feature extraction; Indexes; Internet; Uniform resource locators; EIT path; ITF regex; URL type; forum crawling; page classification; page type;
Conference_Titel :
Information Communication and Embedded Systems (ICICES), 2014 International Conference on
Conference_Location :
Chennai
Print_ISBN :
978-1-4799-3835-3
DOI :
10.1109/ICICES.2014.7033889