DocumentCode :
3074537
Title :
Enhancing Entropy-based Informative Block Identification Using Block Preclustering Technology
Author :
Huang, Chia-Hsin ; Yen, Po-Yi ; Hung, Yi-Chan ; Chuang, Tyng-Ruey ; Lee, Hahn-Ming
Author_Institution :
Nat. Taiwan Univ. of Sci. & Technol., Taipei
Volume :
3
fYear :
2006
fDate :
8-11 Oct. 2006
Firstpage :
2640
Lastpage :
2645
Abstract :
Identifying informative blocks to extract valuable content from web pages is a typical but crucial task in the web mining field. Currently entropy-based informative block extraction approaches achieve both high precision and recall rates. However, they are unable to identify blocks containing a few terms that are used frequently in the main text. To overcome this drawback, we propose a novel approach, called block analyzer, which preclusters blocks based on their structure. An entropy value is then assigned to each cluster as its weight, which is used to determine whether the blocks in the cluster are informative or not. Our experiment results show that about 70% of blocks collected from five types of web site were classified as either noisy or informative by both our method and an entropy-based approach. While the other 30% of blocks were judged as informative by both human analysis and our method, but not by the entropy-based method.
Keywords :
Web sites; data mining; entropy; feature extraction; Web mining field; Web pages; block preclustering technology; entropy-based informative block identification; human analysis; preclusters blocks; Cybernetics; Data mining; Electronic publishing; Entropy; HTML; Humans; Internet; Portals; Web mining; Web pages;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Systems, Man and Cybernetics, 2006. SMC '06. IEEE International Conference on
Conference_Location :
Taipei
Print_ISBN :
1-4244-0099-6
Electronic_ISBN :
1-4244-0100-3
Type :
conf
DOI :
10.1109/ICSMC.2006.385262
Filename :
4274268
Link To Document :
بازگشت