• DocumentCode
    124263
  • Title

    A Clique Based Web Page Classification Corrective Approach

  • Author

    Abdelbadie, Belmouhcine ; Mohammed, Benattou

  • Author_Institution
    Comput. Sci. Dept., Mohammed V-Agdal Univ., Rabat, Morocco
  • Volume
    2
  • fYear
    2014
  • fDate
    11-14 Aug. 2014
  • Firstpage
    467
  • Lastpage
    473
  • Abstract
    Nowadays, the Web is the most relevant data source. Its size does not stop growing day by day. Web page classification becomes crucial due to this overwhelming amount of data. Web pages contain many noisy contents that bias textual classifiers and lead them to lose focus on their main subject. Web pages are related to each other either implicitly by users´ intuitive judgments or explicitly by hyperlinks. Thus, the use of those links in order to correct a class assigned by textual classifier to a web page can be beneficial. In this paper, we propose a post classification corrective approach called Clique Based Correction (CBC) that uses the query-log to build an implicit neighborhood, and collectively corrects classes assigned by a textual classifier to web pages of that neighborhood. This correction helps improve text classifier´s results by correcting wrongly assigned categories. When two web pages are linked to each other, they may share the same topic, but when more web pages (three for example) are all related to each other, the probability that those web pages share the same subject becomes stronger. The proposed method operates in four steps. In the first step, it builds a graph called implicit graph, whose vertices are web pages and edges are implicit links. In the second step, it uses a text classifier to determine classes of all web pages represented by vertices in the implicit graph. In the third step, it extracts cliques of web pages from the implicit graph. In the fourth step, it assigns a class to every clique using a voting process. Each web page will be labeled using the class of its clique. This adjustment leads to improvements of results provided by the text classifier. We conduct our experiments using three classifiers: SVM (Support Vector Machine), NB (Naïve Bayes) and KNN (K Nearest Neighbors), on two subsets of ODP (Open Directory Project). Results show that: (1) when applied after SVM, NB or KNN, CBC helps bringing improvements on r- sults. (2) The number of unrelated web pages must be low in order to have significant improvement.
  • Keywords
    Bayes methods; Internet; classification; graph theory; information retrieval; support vector machines; K nearest neighbor; KNN; Naïve Bayes; ODP; SVM; Web page classification corrective approach; World Wide Web; clique based correction; implicit graph; open directory project; query-log; support vector machine; textual classifier; Art; Classification algorithms; Computers; Niobium; Support vector machines; Text categorization; Web pages; correction; maximum clique; query-log; semantic web; web page classification;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Web Intelligence (WI) and Intelligent Agent Technologies (IAT), 2014 IEEE/WIC/ACM International Joint Conferences on
  • Conference_Location
    Warsaw
  • Type

    conf

  • DOI
    10.1109/WI-IAT.2014.135
  • Filename
    6927662