Title :
Two-phase Web site classification based on hidden Markov tree models
Author :
Tian, YongHong ; Huang, TieJun ; Gao, Wen ; Cheng, Jun ; Kang, PingBo
Author_Institution :
Digital Media Lab., Chinese Acad. of Sci., Beijing, China
Abstract :
With the exponential growth of both the amount and diversity of the information that the Web encompasses, automatic classification of topic-specific Web sites is highly desirable. We propose a novel approach for Web site classification based on the content, structure and context information of Web sites. In our approach, the site structure is represented as a two-layered tree in which each page is modeled as a DOM (document object model) tree and a site tree is used to hierarchically link all pages within the site. Two context models are presented to capture the topic dependences in the site. Then the hidden Markov tree (HMT) model is utilized as the statistical model of the site tree and the DOM tree, and an HMT-based classifier is presented for their classification. Moreover, for reducing the download size of Web sites but still keeping high classification accuracy, an entropy-based approach is introduced to dynamically prune the site trees. On these bases, we employ the two-phase classification system for classifying Web sites through a fine-to-coarse recursion. The experiments show our approach is able to offer high accuracy and efficient process performance.
Keywords :
Web sites; decision trees; hidden Markov models; pattern classification; DOM tree; Web site classification; context information; document object model; entropy-based approach; hidden Markov tree model; statistical model; topic dependencies; Bayesian methods; Classification algorithms; Classification tree analysis; Computer science; Context modeling; Electronic mail; Hidden Markov models; Support vector machines; Tree graphs; Web pages;
Conference_Titel :
Web Intelligence, 2003. WI 2003. Proceedings. IEEE/WIC International Conference on
Print_ISBN :
0-7695-1932-6
DOI :
10.1109/WI.2003.1241198