• DocumentCode
    3229409
  • Title

    Identifying Document Topics Using the Wikipedia Category Network

  • Author

    Schonhofen, Peter

  • Author_Institution
    Comput. & Autom. Res. Inst., Hungarian Acad. of Sci., Budapest
  • fYear
    2006
  • fDate
    18-22 Dec. 2006
  • Firstpage
    456
  • Lastpage
    462
  • Abstract
    In the size and coverage of Wikipedia, a freely available online encyclopedia has reached the point where it can be utilized similar to an ontology or taxonomy to identify the topics discussed in a document. In this paper we show that even a simple algorithm that exploits only the titles and categories of Wikipedia articles can characterize documents by Wikipedia categories surprisingly well. We test the reliability of our method by predicting categories of Wikipedia articles themselves based on their bodies, and by performing classification and clustering on 20 newsgroups and RCV1, representing documents by their Wikipedia categories instead of their texts
  • Keywords
    Web sites; document handling; encyclopaedias; ontologies (artificial intelligence); pattern classification; pattern clustering; Wikipedia category network; document topics; newsgroups; online encyclopedia; ontology; Automation; Clustering algorithms; Computer networks; Content based retrieval; Encyclopedias; Information retrieval; Ontologies; Taxonomy; Testing; Wikipedia;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Web Intelligence, 2006. WI 2006. IEEE/WIC/ACM International Conference on
  • Conference_Location
    Hong Kong
  • Print_ISBN
    0-7695-2747-7
  • Type

    conf

  • DOI
    10.1109/WI.2006.92
  • Filename
    4061411