Title :
The Role of URLs in Objectionable Web Content Categorization
Author :
Zhang, Jianping ; Qin, Jason ; Yan, Qiuming
Author_Institution :
AOL Inc., Dulles, VA
Abstract :
By analyzing a set of access attempts by teenagers to pornographic Web sites, we found that more than half of them are image searches and visits to Web sites with little text information. It is obvious that textual content-based filters cannot correctly categorize such access attempts. This paper describes a novel URL-based objectionable content categorization approach and its application to Web filtering. In this approach, we break the URL into a sequence of n-grams with a range of n´s and then a machine learning algorithm is applied to the n-gram representation of URLs to learn a classifier of pornographic Web sites. We showed empirically that the URL-based approach is able to correctly identify many of the objectionable Web pages. We also demonstrated that the optimum Web filtering results could be achieved when it was used with a content-based approach in a production environment
Keywords :
Web sites; information filtering; learning (artificial intelligence); URL-based objectionable Web content categorization; Web filtering; image search; machine learning algorithm; n-gram sequence; pornographic Web sites; textual content-based filter; Humans; Image analysis; Information analysis; Information filtering; Information filters; Internet; Machine learning algorithms; Production; Uniform resource locators; Web pages;
Conference_Titel :
Web Intelligence, 2006. WI 2006. IEEE/WIC/ACM International Conference on
Conference_Location :
Hong Kong
Print_ISBN :
0-7695-2747-7
DOI :
10.1109/WI.2006.170