Title :
A General Approach for Partitioning Web Page Content Based on Geometric and Style Information
Author :
Guo, Hui ; Mahmud, Jalal ; Borodin, Yevgen ; Stent, Amanda ; Ramakrishnan, I.V.
Author_Institution :
Stony Brook Univ., Stony Brook
Abstract :
In this paper, we describe a general-purpose approach for partitioning Web page content. The novelty of our approach lies in the use of detailed layout information from a Web page renderer to determine spatial locality and identify visual separators, and the use of relaxed matching over presentation style information to determine presentation style similarity. We present several examples to illustrate the generality of our approach.
Keywords :
Internet; general-purpose approach; geometric-style information; partitioning Web page content; visual separators; Clustering algorithms; Computer science; HTML; Humans; Marketing and sales; Ontologies; Particle separators; Partitioning algorithms; Rendering (computer graphics); Web pages;
Conference_Titel :
Document Analysis and Recognition, 2007. ICDAR 2007. Ninth International Conference on
Conference_Location :
Parana
Print_ISBN :
978-0-7695-2822-9
DOI :
10.1109/ICDAR.2007.4377051