DocumentCode
2866147
Title
Using XPath to Discover Informative Content Blocks of Web Pages
Author
Fu, Yan ; Yang, Dongqing ; Tang, Shiwei ; Wang, Tengjiao ; Gao, Jun
Author_Institution
Peking Univ., Beijing
fYear
2007
fDate
29-31 Oct. 2007
Firstpage
450
Lastpage
453
Abstract
Web pages usually contain various contents, which are relevant or irrelevant with the main topic. We define relevant contents as informative content blocks, whereas irrelevant contents as clutters. Clutters intend to mislead search engines, or trigger an artificially high link-based ranking for specific target pages. So cleaning Web pages before mining becomes critical for improving performance of traditional information retrieval. Here, we propose a method to discover informative content block without supervision. Initially, using a set of sample pages, we adopt a series of rules to distinguish informative content blocks from clutters. Then we generalize public XPath for informative content blocks or clutters, and apply it to similar pages. We have implemented our method in five different Web sites, and output more simpler and centralized HTML file. Experimental result shows that our method can obtain informative content blocks of Web page accurately. And another advantage of our approach is that it is completely automatic.
Keywords
Web sites; XML; HTML file; Web page; Web site; XPath; clutters; informative content blocks; Advertising; Cleaning; Data mining; HTML; Image segmentation; Information retrieval; Particle separators; Search engines; Text recognition; Web pages;
fLanguage
English
Publisher
ieee
Conference_Titel
Semantics, Knowledge and Grid, Third International Conference on
Conference_Location
Shan Xi
Print_ISBN
0-7695-3007-9
Electronic_ISBN
978-0-7695-3007-9
Type
conf
DOI
10.1109/SKG.2007.106
Filename
4438592
Link To Document