DocumentCode :
524380
Title :
Improved focused crawling approach for retrieving relevant pages based on block partitioning
Author :
Hati, Debashis ; Kumar, Amritesh
Author_Institution :
Sch. of Comput. Eng., KIIT Univ., Bhubaneswar, India
Volume :
3
fYear :
2010
fDate :
22-24 June 2010
Abstract :
Crawlers are software which can traverse the internet and retrieve web pages by hyperlinks. In the face of the large number of websites, traditional web crawlers cannot function well to get the relevant pages effectively. To solve these problems, focused crawlers utilize semantic web technologies to analyze the semantics of hyperlinks and web documents. The focused crawler is a special-purpose search engine which aims to selectively seek out pages that are relevant to a predefined set of topics, rather than to exploit all regions of the web. The main characteristic of focused crawling is that the crawler does not need to collect all web pages, but selects and retrieves only the relevant pages. So the major problem is how to retrieve the maximal set of relevant and quality pages. To address this problem, we have designed a focused crawler which calculates the relevancy of block in web page. The Block is partitioned by VIPS algorithm. Page relevancy is calculated by sum of all block relevancy scores in one page. It also calculates the URL score for identifying whether a URL is relevant or not for a specific topic.
Keywords :
Internet; information retrieval; search engines; semantic Web; Internet; URL score; VIPS algorithm; Web crawlers; Web documents; Web page retrieval; Web sites; block partitioning; block relevancy scores; improved focused crawling approach; relevant page retrieval; semantic Web technology; software crawlers; special-purpose search engine; Computer science education; Crawlers; Educational technology; Information retrieval; Internet; Partitioning algorithms; Search engines; Uniform resource locators; Web pages; Web server; VIPS algorithm; focused crawler; vector space model;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Education Technology and Computer (ICETC), 2010 2nd International Conference on
Conference_Location :
Shanghai
Print_ISBN :
978-1-4244-6367-1
Type :
conf
DOI :
10.1109/ICETC.2010.5529547
Filename :
5529547
Link To Document :
بازگشت