Title :
Recognition of common areas in a Web page using visual information: a possible application in a page classification
Author :
Kovacevic, Milo ; Diligenti, Michelangelo ; Gori, Marco ; Milutinovic, Veljko
Author_Institution :
Sch. of Civil Eng., Belgrade Univ., Serbia
Abstract :
Extracting and processing information from Web pages is an important task in many areas like constructing search engines, information retrieval, and data mining from the Web. A common approach in the extraction process is to represent a page as a "bag of words" and then to perform additional processing on such a flat representation. We propose a new, hierarchical representation that includes browser screen coordinates for every HTML object in a page. Using visual information one is able to define heuristics for the recognition of common page areas such as header, left and right menu, footer and center of a page. We show in initial experiments that using our heuristics defined objects are recognized properly in 73% of cases. Finally, we show that a Naive Bayes classifier, taking into account the proposed representation, clearly outperforms the same classifier using only information about the content of documents.
Keywords :
Web sites; classification; data mining; hypermedia markup languages; information retrieval; search engines; HTML; Naive Bayes classifier; Web page common area recognition; Web sites; browser screen coordinates; data mining; experiments; heuristics; hierarchical representation; information retrieval; page classification; search engines; visual information; Civil engineering; Crawlers; Data mining; Frequency; HTML; Humans; Information retrieval; Machine learning; Search engines; Web pages;
Conference_Titel :
Data Mining, 2002. ICDM 2003. Proceedings. 2002 IEEE International Conference on
Print_ISBN :
0-7695-1754-4
DOI :
10.1109/ICDM.2002.1183910