Title :
Enhancing Entropy-based Informative Block Identification Using Block Preclustering Technology
Author :
Huang, Chia-Hsin ; Yen, Po-Yi ; Hung, Yi-Chan ; Chuang, Tyng-Ruey ; Lee, Hahn-Ming
Author_Institution :
Nat. Taiwan Univ. of Sci. & Technol., Taipei
Abstract :
Identifying informative blocks to extract valuable content from web pages is a typical but crucial task in the web mining field. Currently entropy-based informative block extraction approaches achieve both high precision and recall rates. However, they are unable to identify blocks containing a few terms that are used frequently in the main text. To overcome this drawback, we propose a novel approach, called block analyzer, which preclusters blocks based on their structure. An entropy value is then assigned to each cluster as its weight, which is used to determine whether the blocks in the cluster are informative or not. Our experiment results show that about 70% of blocks collected from five types of web site were classified as either noisy or informative by both our method and an entropy-based approach. While the other 30% of blocks were judged as informative by both human analysis and our method, but not by the entropy-based method.
Keywords :
Web sites; data mining; entropy; feature extraction; Web mining field; Web pages; block preclustering technology; entropy-based informative block identification; human analysis; preclusters blocks; Cybernetics; Data mining; Electronic publishing; Entropy; HTML; Humans; Internet; Portals; Web mining; Web pages;
Conference_Titel :
Systems, Man and Cybernetics, 2006. SMC '06. IEEE International Conference on
Conference_Location :
Taipei
Print_ISBN :
1-4244-0099-6
Electronic_ISBN :
1-4244-0100-3
DOI :
10.1109/ICSMC.2006.385262