مرکز منطقه ای اطلاع رساني علوم و فناوري - Automatic identification of informative sections of Web pages

DocumentCode :

1126065

Title :

Automatic identification of informative sections of Web pages

Author :

Debnath, Sandip ; Mitra, Prasenjit ; Pal, Nirmal ; Giles, C. Lee

Author_Institution :

Dept. of Comput. Sci. & Eng., Pennsylvania State Univ., University Park, PA, USA

Volume :

Issue :

fYear :

2005

Firstpage :

1233

Lastpage :

1246

Abstract :

Web pages - especially dynamically generated ones - contain several items that cannot be classified as the "primary content," e.g., navigation sidebars, advertisements, copyright notices, etc. Most clients and end-users search for the primary content, and largely do not seek the noninformative content. A tool that assists an end-user or application to search and process information from Web pages automatically, must separate the "primary content sections" from the other content sections. We call these sections as "Web page blocks" or just "blocks." First, a tool must segment the Web pages into Web page blocks and, second, the tool must separate the primary content blocks from the noninformative content blocks. In this paper, we formally define Web page blocks and devise a new algorithm to partition an HTML page into constituent Web page blocks. We then propose four new algorithms, ContentExtractor, FeatureExtractor, K-FeatureExtractor, and L-Extractor. These algorithms identify primary content blocks by 1) looking for blocks that do not occur a large number of times across Web pages, by 2) looking for blocks with desired features, and by 3) using classifiers, trained with block-features, respectively. While operating on several thousand Web pages obtained from various Web sites, our algorithms outperform several existing algorithms with respect to runtime and/or accuracy. Furthermore, we show that a Web cache system that applies our algorithms to remove noninformative content blocks and to identify similar blocks across Web pages can achieve significant storage savings.

Keywords :

Internet; cache storage; classification; content management; data mining; feature extraction; hypermedia markup languages; search engines; text analysis; ContentExtractor algorithm; HTML page partition; K-FeatureExtractor algorithm; L-Extractor algorithm; Web cache system; Web mining; Web page blocks; Web sites; data mining; feature extraction; informative sections automatic identification; inverse block document frequency; noninformative content blocks; primary content sections; text mining; Cache storage; Data mining; Feature extraction; HTML; Navigation; Partitioning algorithms; Runtime; Search engines; Text mining; Web pages; Index Terms- Data mining; Web mining; Web page block; data mining; feature extraction or construction; informative block; inverse block document frequency.; text mining;

fLanguage :

English

Journal_Title :

Knowledge and Data Engineering, IEEE Transactions on

Publisher :

ieee

ISSN :

1041-4347

Type :

jour

DOI :

10.1109/TKDE.2005.138

Filename :

1490530

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=1126065