DocumentCode :
1970421
Title :
Web page DOM node characterization and its application to page segmentation
Author :
Vineel, Gujjar
Author_Institution :
GE Res., Comput. & Decision Sci. Lab., India
fYear :
2009
fDate :
9-11 Dec. 2009
Firstpage :
1
Lastpage :
6
Abstract :
Web pages are generally organized in terms of visually distinct segments, such as Navigation bars, Advertisement banners, Headers, Portlets and Widgets. Despite the apparent structured layout, web pages are considered a source of unstructured data, from information extraction point of view. Hence, as a step towards interpreting the organization of web data, web page segmentation attempts to identify cohesive regions of a page. In this paper, we present a novel DOM tree mining approach for page segmentation. We first characterize the nodes of DOM tree structure, based on their Content Size and Entropy. While Content Size of a node indicates the amount of textual content contributed by its subtree, Entropy measures the strength of local ¿patterns¿ exhibited therein. In other words, a node manifesting highly repetitive patterns begets a high Entropy as per our formulation. Based on the characterization of DOM nodes, we then develop an unsupervised algorithm to automatically identify segments of a given web page.
Keywords :
Internet; distributed object management; entropy; DOM tree mining approach; Web page DOM node characterization; advertisement banners; content size; entropy; headers; information extraction; navigation bars; page segmentation application; portlets; unstructured data; visually distinct segments; widgets; Bars; Data mining; Entropy; HTML; Navigation; Size measurement; Tree data structures; Tree graphs; Usability; Web pages; Document Object Model; Entropy; Web Information Extraction; Web Page Segmentation;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Internet Multimedia Services Architecture and Applications (IMSAA), 2009 IEEE International Conference on
Conference_Location :
Bangalore
Print_ISBN :
978-1-4244-4792-3
Electronic_ISBN :
978-1-4244-4793-0
Type :
conf
DOI :
10.1109/IMSAA.2009.5439444
Filename :
5439444
Link To Document :
بازگشت