Identifying Document Topics Using the Wikipedia Category Network

Author

Schonhofen, Peter

Author_Institution

Comput. & Autom. Res. Inst., Hungarian Acad. of Sci., Budapest

fYear

2006

fDate

18-22 Dec. 2006

Firstpage

456

Lastpage

462

Abstract

In the size and coverage of Wikipedia, a freely available online encyclopedia has reached the point where it can be utilized similar to an ontology or taxonomy to identify the topics discussed in a document. In this paper we show that even a simple algorithm that exploits only the titles and categories of Wikipedia articles can characterize documents by Wikipedia categories surprisingly well. We test the reliability of our method by predicting categories of Wikipedia articles themselves based on their bodies, and by performing classification and clustering on 20 newsgroups and RCV1, representing documents by their Wikipedia categories instead of their texts

Keywords

Web sites; document handling; encyclopaedias; ontologies (artificial intelligence); pattern classification; pattern clustering; Wikipedia category network; document topics; newsgroups; online encyclopedia; ontology; Automation; Clustering algorithms; Computer networks; Content based retrieval; Encyclopedias; Information retrieval; Ontologies; Taxonomy; Testing; Wikipedia;

fLanguage

English

Publisher

ieee

Conference_Titel

Web Intelligence, 2006. WI 2006. IEEE/WIC/ACM International Conference on

Conference_Location

Hong Kong

Print_ISBN

0-7695-2747-7

Type

conf

DOI

10.1109/WI.2006.92

Filename

4061411

Link To Document

https://search.isc.ac/dl/search/defaultta.aspx?DTC=49&DC=3229409