مرکز منطقه ای اطلاع رساني علوم و فناوري - Web page DOM node characterization and its application to page segmentation

DocumentCode :

1970421

Title :

Web page DOM node characterization and its application to page segmentation

Author :

Vineel, Gujjar

Author_Institution :

GE Res., Comput. & Decision Sci. Lab., India

fYear :

2009

fDate :

9-11 Dec. 2009

Firstpage :

Lastpage :

Abstract :

Web pages are generally organized in terms of visually distinct segments, such as Navigation bars, Advertisement banners, Headers, Portlets and Widgets. Despite the apparent structured layout, web pages are considered a source of unstructured data, from information extraction point of view. Hence, as a step towards interpreting the organization of web data, web page segmentation attempts to identify cohesive regions of a page. In this paper, we present a novel DOM tree mining approach for page segmentation. We first characterize the nodes of DOM tree structure, based on their Content Size and Entropy. While Content Size of a node indicates the amount of textual content contributed by its subtree, Entropy measures the strength of local Â¿patternsÂ¿ exhibited therein. In other words, a node manifesting highly repetitive patterns begets a high Entropy as per our formulation. Based on the characterization of DOM nodes, we then develop an unsupervised algorithm to automatically identify segments of a given web page.

Keywords :

Internet; distributed object management; entropy; DOM tree mining approach; Web page DOM node characterization; advertisement banners; content size; entropy; headers; information extraction; navigation bars; page segmentation application; portlets; unstructured data; visually distinct segments; widgets; Bars; Data mining; Entropy; HTML; Navigation; Size measurement; Tree data structures; Tree graphs; Usability; Web pages; Document Object Model; Entropy; Web Information Extraction; Web Page Segmentation;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Internet Multimedia Services Architecture and Applications (IMSAA), 2009 IEEE International Conference on

Conference_Location :

Bangalore

Print_ISBN :

978-1-4244-4792-3

Electronic_ISBN :

978-1-4244-4793-0

Type :

conf

DOI :

10.1109/IMSAA.2009.5439444

Filename :

5439444

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=1970421