DocumentCode :
2580726
Title :
Web information hierarchy and importance mining based on DOM information distillation
Author :
Feng, Tseng-Yi ; Kao, Hung-Yu
Author_Institution :
Dept. of Comput. Sci. & Inf. Eng., Nat. Cheng Kung Univ., Tainan, Taiwan
fYear :
2005
fDate :
15-16 Aug. 2005
Abstract :
Web provides people a convenient way to disseminate and search information. Due to the growth of dynamic page generation techniques, the amount and the complexity of Web pages has been increasing explosively, as has the information contained within Web pages. Redundant information is distributed throughout a page, making it difficult to automatically identify the useful information in that page. In this paper, we propose and implement a simple Web importance extraction and labeling system based on the analysis on content information and vision information of a Web page. We apply the information theory on the document object model (DOM) trees of pages and extract the vision information for each block to evaluate their importance. Results show that our system effectively extracts and labeling the importance of a page and provides a powerful surfing interface for small display device browsing. Experiments on several Web sites show high performance to meet the users´ information focus.
Keywords :
Internet; Web sites; data mining; information analysis; information retrieval; information theory; DOM information distillation; Web importance extraction; Web information hierarchy; Web labeling system; Web page complexity; Web page vision information; Web sites; World Wide Web; content information analysis; document object model trees; dynamic page generation; importance mining; information dissemination; information searching; information theory; vision information extraction; Buildings; Computer science; Data mining; Displays; Electronic mail; Information analysis; Information theory; Labeling; Power system modeling; Web pages;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Emerging Information Technology Conference, 2005.
Print_ISBN :
0-7803-9328-7
Type :
conf
DOI :
10.1109/EITC.2005.1544346
Filename :
1544346
Link To Document :
بازگشت