DocumentCode
2223604
Title
Incremental Web Page Template Detection by Text Segments
Author
Wang, Yu ; Fang, Bingxing ; Cheng, Xueqi ; Guo, Li ; Xu, Hongbo
Author_Institution
Inst. of Comput. Technol., Chinese Acad. of Sci., Beijing
fYear
2008
fDate
14-15 July 2008
Firstpage
174
Lastpage
180
Abstract
Template detection technique is important for many applications. Most template detection methods utilize content repetition as a hint to detect template blocks that lots of Web pages are required as input. So they usually process Web pages in batches that a newly crawled page can not be processed until enough pages are collected. This consumes large storage consumption to cache Web pages and results in a huge delay in data refreshing. In this paper, we present an incremental framework to detect templates in which a page is processed as soon as it has been crawled. Under this framework, we donpsilat need to cache any Web page. Experiments show that our framework consumes less than 7% storage than traditional methods. And also the delay of data refreshing induced by the batch process is completely eliminated.
Keywords
Internet; text analysis; Web pages; incremental Web page template detection; text segments; Bars; Cache storage; Computers; Conferences; Degradation; Delay; Feeds; Navigation; Search engines; Web pages;
fLanguage
English
Publisher
ieee
Conference_Titel
Semantic Computing and Systems, 2008. WSCS '08. IEEE International Workshop on
Conference_Location
Huangshan
Print_ISBN
978-0-7695-3316-2
Electronic_ISBN
978-0-7695-3316-2
Type
conf
DOI
10.1109/WSCS.2008.17
Filename
4570835
Link To Document