Title :
UPCA: An efficient URL-Pattern based algorithm for accurate web page classification
Author :
Yiming Yang;Lei Zhang; Guiquan Liu; Enhong Chen
Author_Institution :
University of Science and Technology of China, Anhui, Hefei 230000, China
Abstract :
With the explosive growth of Web pages appearing in the Internet and mobile Internet, it is quite challenging for Web search engines to provide users with desirable results from large amount of data. One of important problems for improving the service in search engines is web page classification. For this problem, current approaches usually first extract features from Web pages and then use traditional machine learning methods for training process. However, these methods usually are time-consuming and do not take incremental learning into consideration at all, thus they may not be suitable for online applications. Therefore, in this paper we rethink the problem of Web page classification by only using URLs and propose an efficient Url-Pattern based Classification Algorithm (named UPCA). Specifically, given a set of training samples with the same label, we first construct a pattern tree and extract main patterns from it. Then we can classify new Web pages by matching their URLs to the patterns. Also, we propose an efficient incremental pattern-tree algorithm. Experimental results show that the proposed approach achieves very promising performance, in terms of both classification accuracy and computational efficiency.
Keywords :
"Uniform resource locators","Web pages","Feature extraction","Training","Training data","Entropy","Partitioning algorithms"
Conference_Titel :
Fuzzy Systems and Knowledge Discovery (FSKD), 2015 12th International Conference on
DOI :
10.1109/FSKD.2015.7382162