DocumentCode
480698
Title
Leveraging Web 2.0 Sources for Web Content Classification
Author
Banerjee, Somnath ; Scholz, Martin
Author_Institution
Hewlett-Packard Labs., Bangalore
Volume
1
fYear
2008
fDate
9-12 Dec. 2008
Firstpage
300
Lastpage
306
Abstract
This paper addresses practical aspects of Web page classification not captured by the classical text mining framework. Classifiers are supposed to perform well on a broad variety of pages. We argue that constructing training corpora is a bottleneck for building such classifiers, and that care has to be taken if the goal is to generalize to previously unseen kinds of pages on the Web. We study techniques for building training corpora automatically from publicly available Web resources, quantify the discrepancy between them, and demonstrate that encouraging agreement between classifiers given such diverse sources drastically outperforms methods that ignore the different natures of data sources on the Web.
Keywords
Internet; classification; data mining; text analysis; Web 2.0 source; Web content classification; text mining; Buildings; Information filtering; Information filters; Information services; Intelligent agent; Internet; Labeling; Text mining; Web pages; Web sites; corpus construction; text mining; web 2.0; web classification;
fLanguage
English
Publisher
ieee
Conference_Titel
Web Intelligence and Intelligent Agent Technology, 2008. WI-IAT '08. IEEE/WIC/ACM International Conference on
Conference_Location
Sydney, NSW
Print_ISBN
978-0-7695-3496-1
Type
conf
DOI
10.1109/WIIAT.2008.291
Filename
4740464
Link To Document