DocumentCode
589237
Title
A Multi-label and Adaptive Genre Classification of Web Pages
Author
Jebari, Chaker ; Wani, M. Arif
Author_Institution
Comput. Sci. Dept., Fac. of Sci. of Tunis, Tunis, Tunisia
Volume
1
fYear
2012
fDate
12-15 Dec. 2012
Firstpage
578
Lastpage
581
Abstract
This paper proposes a new centroid-based approach to classify web pages by genre using character ngrams extracted from different information sources such as URL, title, headings and anchors. To deal with the complexity of web pages and the rapid evolution of web genres, our approach implements a multi-label and adaptive classification scheme in which web pages are classified one by one and each web page can affect more than one genre. According to the similarity between the new page and each genre centroid, our approach either adapts the genre centroid under consideration or considers the new page as noise page and discards it. The experiment results show that our approach is very fast and achieves better results than existing multi-label classifiers.
Keywords
Web sites; classification; information retrieval; URL; Web genre; Web page classification; adaptive genre classification scheme; anchors; character ngram; genre centroid; headings; information source extraction; multilabel genre classification scheme; noise page; title; Classification algorithms; Complexity theory; Data mining; Search engines; Training; Vectors; Web pages; Multi-label; adaptive; centroid; classification; genre;
fLanguage
English
Publisher
ieee
Conference_Titel
Machine Learning and Applications (ICMLA), 2012 11th International Conference on
Conference_Location
Boca Raton, FL
Print_ISBN
978-1-4673-4651-1
Type
conf
DOI
10.1109/ICMLA.2012.106
Filename
6406627
Link To Document