• DocumentCode
    480698
  • Title

    Leveraging Web 2.0 Sources for Web Content Classification

  • Author

    Banerjee, Somnath ; Scholz, Martin

  • Author_Institution
    Hewlett-Packard Labs., Bangalore
  • Volume
    1
  • fYear
    2008
  • fDate
    9-12 Dec. 2008
  • Firstpage
    300
  • Lastpage
    306
  • Abstract
    This paper addresses practical aspects of Web page classification not captured by the classical text mining framework. Classifiers are supposed to perform well on a broad variety of pages. We argue that constructing training corpora is a bottleneck for building such classifiers, and that care has to be taken if the goal is to generalize to previously unseen kinds of pages on the Web. We study techniques for building training corpora automatically from publicly available Web resources, quantify the discrepancy between them, and demonstrate that encouraging agreement between classifiers given such diverse sources drastically outperforms methods that ignore the different natures of data sources on the Web.
  • Keywords
    Internet; classification; data mining; text analysis; Web 2.0 source; Web content classification; text mining; Buildings; Information filtering; Information filters; Information services; Intelligent agent; Internet; Labeling; Text mining; Web pages; Web sites; corpus construction; text mining; web 2.0; web classification;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Web Intelligence and Intelligent Agent Technology, 2008. WI-IAT '08. IEEE/WIC/ACM International Conference on
  • Conference_Location
    Sydney, NSW
  • Print_ISBN
    978-0-7695-3496-1
  • Type

    conf

  • DOI
    10.1109/WIIAT.2008.291
  • Filename
    4740464