DocumentCode :
2948958
Title :
Intelligent crawler for web forums based on improved regular expressions
Author :
Pavkovic, Milos ; Protic, Jelica
Author_Institution :
Sch. of Electr. Eng., Univ. of Belgrade, Belgrade, Serbia
fYear :
2013
fDate :
26-28 Nov. 2013
Firstpage :
817
Lastpage :
820
Abstract :
In this paper, we present the development and characteristics of a specialized Web-scale forum crawler. The main idea is to crawl relevant forum content from the Web with minimal server resource consumption, and to organize crawled content into logical units, in order to make it easier for further processing and analysis. Forum posts contain relevant information that are of interest to forum crawler. Although forums have different designs, and are built on different technologies, they always have identical logic navigation that connects homepage and particular posts through forum lists and threads by specific URLs. Considering this common implicit navigation, we have optimized Web crawling problem to be URL-type recognition problem. URL-type database and regular expressions are used in order to achieve URL-type recognition. These regular expressions are expanded with special custom characters and commands that gave this forum crawler advantage over other Web based crawlers. The results shown in this paper are obtained by crawling a set of Web forums with different technology, location and design. Each test compared the results obtained by standard Web based crawler and our specialized forum crawler. Our test results show that by crawling only specific data and URL paths on the forum, we have managed to reduce the time of crawling and to achieve lower server resources consumption.
Keywords :
Web sites; information retrieval; knowledge based systems; search engines; URL paths; URL-type database; URL-type recognition problem; Web-scale forum crawler characteristics; Web-scale forum crawler development; custom characters; forum lists; forum posts; homepage; implicit navigation; information analysis; information processing; intelligent crawler; logical navigation; minimal server resource consumption; optimized Web crawling problem; regular expressions; Crawlers; Databases; Educational institutions; Internet; Knowledge based systems; Message systems; Software packages; URL type; crawler; forum; regular expressions; web search;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Telecommunications Forum (TELFOR), 2013 21st
Conference_Location :
Belgrade
Print_ISBN :
978-1-4799-1419-7
Type :
conf
DOI :
10.1109/TELFOR.2013.6716355
Filename :
6716355
Link To Document :
بازگشت