DocumentCode :
3740494
Title :
Blog, Forum or Newspaper? Web Genre Detection Using SVMs
Author :
Philipp Berger;Patrick Hennig;Martin Schoenberg;Christoph Meinel
Author_Institution :
Hasso-Plattner-Inst., Univ. of Potsdam, Potsdam, Germany
Volume :
3
fYear :
2015
Firstpage :
64
Lastpage :
68
Abstract :
In recent years, blogs have become a very popular way to publish information, express opinions and hold discussions. Hence researchers and industry have interest in analyzing the blogosphere. Due to the increasing diversity of blog usage, the initial categorization into web genres is the first necessary step before any analyses. In this research, we focus on the distinction between traditional blogs, news portals, forums and miscellaneous websites. Especially the new distinction between news portals and blogs allows analyses to adapt to the network-specific characteristics of traditional media with high journalistic effort and more personal weblogs and their authors. We present a set of 80 features and extensively experiment with possible combinations and SVM parameters to identify the best constellation for the categorization into the four different web genres. Our experiments show a maximal accuracy of 83.5% overall. This high precision was reached using a combination of trained n-grams, structural properties (e.g. Twitter links) and quantitative properties like the text´s length and number of dates.
Keywords :
"Blogs","Portals","Feature extraction","Support vector machines","HTML","Media","Twitter"
Publisher :
ieee
Conference_Titel :
Web Intelligence and Intelligent Agent Technology (WI-IAT), 2015 IEEE / WIC / ACM International Conference on
Type :
conf
DOI :
10.1109/WI-IAT.2015.59
Filename :
7397424
Link To Document :
بازگشت