• DocumentCode
    2869971
  • Title

    Using Visual Features for Fine-Grained Genre Classification of Web Pages

  • Author

    Levering, Ryan ; Cutler, Michal ; Yu, Lei

  • Author_Institution
    State Univ. of New York, Binghamton
  • fYear
    2008
  • fDate
    7-10 Jan. 2008
  • Firstpage
    131
  • Lastpage
    131
  • Abstract
    The field of automatic genre classification has primarily focused on extracting textual features from documents. The goal of this research is to investigate whether visual features of HTML web pages can improve the classification of fine-grained genres. Intuitively it seems that this should be helpful and the challenge is to extract those visual features that capture the layout characteristics of the genres. A corpus of Web pages from different e-commerce sites was generated and manually classified into several genres. Three different sets of features were compared - one with just textual features, one with HTML level features added, and a third with visual features added. Our experiments confirm that using HTML features and particularly URL address features can improve classification beyond using textual features alone. We also show that adding visual features can be useful for further improving fine-grained genre classification.
  • Keywords
    Web sites; document image processing; feature extraction; hypermedia markup languages; image classification; query processing; HTML Web pages; e-commerce sites; fine-grained genre classification; textual feature extraction; visual features; Context; Costs; Design methodology; Feature extraction; HTML; Information science; Printers; Search engines; Uniform resource locators; Web pages;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Hawaii International Conference on System Sciences, Proceedings of the 41st Annual
  • Conference_Location
    Waikoloa, HI
  • ISSN
    1530-1605
  • Type

    conf

  • DOI
    10.1109/HICSS.2008.488
  • Filename
    4438834