• DocumentCode
    3635083
  • Title

    An n-Gram Based Approach to Multi-Labeled Web Page Genre Classification

  • Author

    Jane E. Mason;Michael Shepherd;Jack Duffy;Vlado Keselj;Carolyn Watters

  • Author_Institution
    Dalhousie Univ., Halifax, NS, Canada
  • fYear
    2010
  • Firstpage
    1
  • Lastpage
    10
  • Abstract
    The extraordinary growth in both the size and popularity of the World Wide Web has created a growing interest not only in identifying Web page genres, but also in using these genres to classify Web pages. The hypothesis of this research is that an n-gram representation of a Web page can be used effectively to automatically classify that Web page by genre, even when the Web page belongs to more than one genre. Experiments are run on a multi-labeled data set using both an SVM classifier and a distance function classification model. These n-gram based methods had very high precision results but somewhat lower recall results, indicating that the genre labels assigned by the classifiers are quite accurate, but that these machine learning classifiers are not assigning as many labels as did the human classifiers. The classification results compare favorably with those of other researchers on the same data set.
  • Keywords
    "Web pages","HTML","Uniform resource locators","Web sites","Information filtering","Information filters","Labeling","Statistics","Lifting equipment","Support vector machines"
  • Publisher
    ieee
  • Conference_Titel
    System Sciences (HICSS), 2010 43rd Hawaii International Conference on
  • ISSN
    1530-1605
  • Print_ISBN
    978-1-4244-5509-6
  • Type

    conf

  • DOI
    10.1109/HICSS.2010.58
  • Filename
    5428327