DocumentCode :
3108745
Title :
A Direct Web Page Templates Detection Method
Author :
Xie Su-bin ; Liang Bin ; Shi Wen-chang ; Liang Zhao-hui ; Yu Xiu-mei ; Zhang Lei
Author_Institution :
Sch. of Inf., Renmin Univ. of China, Beijing, China
fYear :
2011
fDate :
16-18 Aug. 2011
Firstpage :
1
Lastpage :
4
Abstract :
Currently, a large number of web sites are generated from web templates so as to improve the productivity of web sites construction. However, the prevalence of web templates has a negative impact on the efficiency of search engine in many aspects, including the relevance judgment of web IR and resource usage of analysis tool. In this paper, we present a direct and fast method to detect pages of the same template by DOM tree characteristics. After analyzing and compressing DOM tree nodes of the HTML page, our method generates a hash value digest, also called fingerprint, for each page to identify its DOM structure. In addition, we also introduce some other page features to aid in judging the page template type. Through experimental evaluations over thirty thousand sub-domains, we show that our approach can obtain the analysis results rapidly but with a high accuracy rate above 95 percents.
Keywords :
Web sites; hypermedia markup languages; search engines; DOM tree characteristics; HTML page; direct web page templates detection method; hash value; search engine; web sites; Accuracy; Compression algorithms; Fingerprint recognition; HTML; Search engines; Web pages;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Internet Technology and Applications (iTAP), 2011 International Conference on
Conference_Location :
Wuhan
Print_ISBN :
978-1-4244-7253-6
Type :
conf
DOI :
10.1109/ITAP.2011.6006435
Filename :
6006435
Link To Document :
بازگشت