Abstract :
Web pages are generally organized in terms of visually distinct segments, such as Navigation bars, Advertisement banners, Headers, Portlets and Widgets. Despite the apparent structured layout, web pages are considered a source of unstructured data, from information extraction point of view. Hence, as a step towards interpreting the organization of web data, web page segmentation attempts to identify cohesive regions of a page. In this paper, we present a novel DOM tree mining approach for page segmentation. We first characterize the nodes of DOM tree structure, based on their Content Size and Entropy. While Content Size of a node indicates the amount of textual content contributed by its subtree, Entropy measures the strength of local ¿patterns¿ exhibited therein. In other words, a node manifesting highly repetitive patterns begets a high Entropy as per our formulation. Based on the characterization of DOM nodes, we then develop an unsupervised algorithm to automatically identify segments of a given web page.
Keywords :
Internet; distributed object management; entropy; DOM tree mining approach; Web page DOM node characterization; advertisement banners; content size; entropy; headers; information extraction; navigation bars; page segmentation application; portlets; unstructured data; visually distinct segments; widgets; Bars; Data mining; Entropy; HTML; Navigation; Size measurement; Tree data structures; Tree graphs; Usability; Web pages; Document Object Model; Entropy; Web Information Extraction; Web Page Segmentation;