Title :
Web informative content block detecting based on entropy and parent-child relationship in DOM
Author :
Ding, Yanhui ; Li, Qingzhong ; Yan, Zhongmin ; Dong, Yongquan
Author_Institution :
Sch. of Comput. Sci. & Technol., Shandong Univ., Jinan
Abstract :
To increase the commercial value and accessibility of pages, most sites tend to publish their pages with redundant information, such as navigation panels, advertisements, and copyright announcements. Such redundant information almost exists in all pages of the website, which increases the index size of general search engines and causes page topics to drift. In this paper, we propose an informative content blocks detecting system called WICBDPCR (Web Informative Content Block Detecting based on Parent-Child Relationship in the document object model) which applies Information Theory to DOM tree in order to detect the informative structure. Experiments on several real commercial Web sites show high precision and recall rates, which validate WICBDPCRpsilas practical applicability.
Keywords :
Web sites; document handling; entropy; information retrieval; search engines; tree data structures; Web informative content block detection system; Web sites; document object model tree; entropy method; information theory; parent-child relationship; search engine; Automation; Computer science; Data mining; Entropy; IEEE news; Information theory; Navigation; Object detection; Search engines; Web pages;
Conference_Titel :
Information and Automation, 2008. ICIA 2008. International Conference on
Conference_Location :
Changsha
Print_ISBN :
978-1-4244-2183-1
Electronic_ISBN :
978-1-4244-2184-8
DOI :
10.1109/ICINFA.2008.4607991