DocumentCode :
1624513
Title :
Web page classification using n-gram based URL features
Author :
Rajalakshmi, R. ; Aravindan, Chandrabose
Author_Institution :
SSN Coll. of Eng., Chennai, India
fYear :
2013
Firstpage :
15
Lastpage :
21
Abstract :
Exponential increase in the number of web pages in the World Wide Web poses a great challenge in information filtering and also makes topic focused crawling a time consuming process in searching for relevant information. We propose an URL based web page classification method that does not need either the web page content or its link structure. In the proposed approach, character n-gram based features are extracted from URLs alone and classification is done by Support Vector Machines and Maximum Entropy Classifiers. The performance of the system was evaluated on two bench mark datasets viz., ODP with 2 million URLs and WebKB with 4K URLs. We used F1 as a performance metric and our experimental results showed an improvement of 20.5% increase on WebKB dataset and 4.7% increase on ODP dataset.
Keywords :
Internet; pattern classification; support vector machines; F1 metric; ODP dataset; URL features; Web page classification; WebKB dataset; character n-gram based features; information filtering; information search; maximum entropy classifiers; support vector machines; topic focused crawling; Art; Business; Classification algorithms; Computers; Information filters; Support vector machines; Machine Learning; Maximum Entropy Classifier; Support Vector Machine; URL Features; Web page classification; n-gram;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Advanced Computing (ICoAC), 2013 Fifth International Conference on
Conference_Location :
Chennai
Print_ISBN :
978-1-4799-3447-8
Type :
conf
DOI :
10.1109/ICoAC.2013.6921920
Filename :
6921920
Link To Document :
بازگشت