DocumentCode :
2354021
Title :
Extraction of Informative Blocks from Web Pages
Author :
Cao, Yujuan ; Niu, Zhendong ; DAI, Liuling ; Zhao, YuMing
Author_Institution :
Lab. of Comput. Sci., Beijing Inst. of Technol., Beijing
fYear :
2008
fDate :
23-25 July 2008
Firstpage :
544
Lastpage :
549
Abstract :
Typically Web pages always contain a large amount of banner ads, navigation bars, and copyright notices etc. Such irrelevant information is not part of the main contents of the pages, they will seriously harm Web mining and searching. In this paper, we develop and evaluate a method that utilizes both the visual features and the semantic information to extract informative blocks. We first partition a web page into semantic blocks using vision-based page segmentation. The visual and the semantic information got by LSI (Latent Semantic Indexing) are extracted to form the feature-vector of the block. Second we manually annotate informative or uninformative labels to the blocks. The labeled blocks are used as training dataset to train a classification model. Then the informative blocks can be extracted through the model. Our experiments show that the proposed EIBA (Extract Informative Block Arithmetic) is able to dramatically improve the results in near-duplicate detection and classification tasks.
Keywords :
Internet; Web sites; data mining; feature extraction; text analysis; Web mining; Web page content; Web searching; classification model; extract informative block arithmetic; feature vector; informative block extraction; latent semantic indexing; semantic information; vision-based page segmentation; visual features; Arithmetic; Data mining; HTML; Indexing; Information retrieval; Large scale integration; Navigation; Text categorization; Web mining; Web pages; LSI; SVM; VIPS; Web; Web Page segmentation; data mining; information extraction;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Advanced Language Processing and Web Information Technology, 2008. ALPIT '08. International Conference on
Conference_Location :
Dalian Liaoning
Print_ISBN :
978-0-7695-3273-8
Type :
conf
DOI :
10.1109/ALPIT.2008.106
Filename :
4584425
Link To Document :
بازگشت