Title :
A Topic-Specific Web Crawler with Web Page Hierarchy Based on HTML Dom-Tree
Author :
Yang, Yuekui ; Du, Yajun ; Hai, Yufeng ; Gao, Zhaoqiong
Author_Institution :
Sch. of Math. & Comput. Sci., Xihua Univ., Chengdu, China
Abstract :
With Internet growing exponentially, data mining in the web becomes the main method to find relevant information. With the amount of web sites and documents growing even faster and site contents getting updated more and more often, focused web crawler is becoming more and more popular. In the literature, how to order the unvisited URLs was studied deeply, they calculate the prediction score is based on the unvisited URLspsila ancestor, however the URLs in one web page is considered to have the same scores. In other words, they consider a web page has only one topic information. But we find the different parts of a web page have their own topic information, while they all support one or several big topics, so the URLs in different paragraphs should be given different scores based on the hierarchy relationship among them. In this paper, we parse every web page as a Dom-Tree, propose some rules in the tree aiming at extracting the relationship among different paragraphs, and then present a new topic-specific web crawler which calculates the unvisited URLpsilas prediction score based on the web page hierarchy and the text semantic similarity. We consider three factors, firstly, we calculate the text similarity using vector space model (VSM) which consider the query or paragraph as a vector in which the terms are independent. But there are relations about termspsila sequences in a text paragraph; we try to using edit distance based on termspsila sequences to avoid it. Thirdly, different paragraphs in a web page are contacted according to their hierarchy in a Dom-Tree. At last we combine the three factors in our crawlerpsilas strategy and present our model.
Keywords :
Internet; Web sites; hypermedia markup languages; text analysis; trees (mathematics); HTML Dom-Tree; Internet; Web crawler; Web page hierarchy; Web sites; World Wide Web; data mining; prediction score; text paragraph; text similarity; unvisited URL; vector space model; Computer science; Crawlers; Data mining; HTML; Information processing; Internet; Mathematics; Search engines; Uniform resource locators; Web pages; Dom-Tree; Edit distance; Focused web crawler; Semantic similarity;
Conference_Titel :
Information Processing, 2009. APCIP 2009. Asia-Pacific Conference on
Conference_Location :
Shenzhen
Print_ISBN :
978-0-7695-3699-6
DOI :
10.1109/APCIP.2009.110