Title :
Block Classification of a Web Page by Using a Combination of Multiple Classifiers
Author :
Kang, Jinbeom ; Choi, Joongmin
Author_Institution :
Dept. of Comput. Sci. & Eng., Hanyang Univ., Ansan
Abstract :
Recently, researchers have been actively studying on Web mining with various data in the World Wide Web. Since Web pages are generally semi-structured, which makes it difficult to identify informative blocks, techniques of content detection by removing unnecessary data (e.g. advertisements) from the Web pages become important. Generally a Web page consists of many blocks containing various data and structural information. In this paper, we propose a method that classifies the blocks of a Web page into an appropriate category by building a Tree Alignment model representing HTML structure and a Vector model representing the features of the blocks. Web sites normally have their own templates and the blocks may be related to different categories even though they are located in the same position in the Web browser or are structurally similar. Hence it is difficult to classify the blocks into accurate categories through building one classifier. To solve the problem, in our approach, multiple classifiers are built, one for each training domain, and the block classification proceeds through combining them.
Keywords :
Web sites; data mining; hypermedia markup languages; online front-ends; pattern classification; tree data structures; HTML structure; Web browser; Web mining; Web page; Web site; World Wide Web; block classification; tree alignment model; Buildings; Classification tree analysis; Computer networks; Computer science; Data engineering; HTML; Information management; Web mining; Web pages; Web sites; combining multiple classifiers; web block classification; web data mining;
Conference_Titel :
Networked Computing and Advanced Information Management, 2008. NCM '08. Fourth International Conference on
Conference_Location :
Gyeongju
Print_ISBN :
978-0-7695-3322-3
DOI :
10.1109/NCM.2008.170