DocumentCode :
2484775
Title :
Webpage Genre Identification Using Variable-Length Character n-Grams
Author :
Kanaris, Ioannis ; Stamatatos, Efstathios
Author_Institution :
Univ. of the Aegean, Mytilene
Volume :
2
fYear :
2007
fDate :
29-31 Oct. 2007
Firstpage :
3
Lastpage :
10
Abstract :
An important factor for discriminating between Web pages is their genre (e.g., blogs, personal homepages, e-shops, online newspapers, etc). Web page genre identification has a great potential in information retrieval since users of search engines can combine genre-based and traditional topic-based queries to improve the quality of the results. So far, various features have been proposed to quantify the style of Web pages including word and HTML-tag frequencies. In this paper, we propose a low-level representation for this problem based on character n-grams. Using an existing approach, we produce feature sets of variable-length character n- grams and combine this representation with information about the most frequent HTML-tags. Based on two benchmark corpora, we present Web page genre identification experiments and improve the best reported results in both cases.
Keywords :
Web sites; hypermedia markup languages; query processing; HTML; Web page; genre identification; information retrieval; topic-based queries; variable-length character n-grams; Artificial intelligence; Automatic control; Blogs; Data mining; Frequency; HTML; Information retrieval; Navigation; Robustness; Search engines;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Tools with Artificial Intelligence, 2007. ICTAI 2007. 19th IEEE International Conference on
Conference_Location :
Patras
ISSN :
1082-3409
Print_ISBN :
978-0-7695-3015-4
Type :
conf
DOI :
10.1109/ICTAI.2007.107
Filename :
4410349
Link To Document :
بازگشت