DocumentCode :
3625602
Title :
Training the Genre Classifier for Automatic Classification of Web Pages
Author :
Vedrana Vidulin;Mitja Lustrek;Matjaz Gams
Author_Institution :
Jo?ef Stefan Institute, Jamova 39, SI-1000 Ljubljana. vedrana.vidulin@ijs.si
fYear :
2007
fDate :
6/1/2007 12:00:00 AM
Firstpage :
93
Lastpage :
98
Abstract :
This paper presents experiments on classifying web pages by genre. Firstly, a corpus of 1539 manually labeled web pages was prepared. Secondly, 502 genre features were selected based on the literature and the observation of the corpus. Thirdly, these features were extracted from the corpus to obtain a data set. Finally, two machine learning algorithms, one for induction of decision trees (J48) and one ensemble algorithm (bagging), were trained and tested on the data set. The ensemble algorithm achieved on average 17% better precision and 1.6% better accuracy, but slightly worse recall; F-measure did not vary significantly. The results indicate that classification by genre could be a useful addition to search engines.
Keywords :
"Web pages","Machine learning algorithms","Search engines","Internet","Feature extraction","Data mining","Decision trees","Bagging","Testing","Africa"
Publisher :
ieee
Conference_Titel :
Information Technology Interfaces, 2007. ITI 2007. 29th International Conference on
ISSN :
1330-1012
Print_ISBN :
953-7138-09-7
Type :
conf
DOI :
10.1109/ITI.2007.4283750
Filename :
4283750
Link To Document :
بازگشت