DocumentCode
3229409
Title
Identifying Document Topics Using the Wikipedia Category Network
Author
Schonhofen, Peter
Author_Institution
Comput. & Autom. Res. Inst., Hungarian Acad. of Sci., Budapest
fYear
2006
fDate
18-22 Dec. 2006
Firstpage
456
Lastpage
462
Abstract
In the size and coverage of Wikipedia, a freely available online encyclopedia has reached the point where it can be utilized similar to an ontology or taxonomy to identify the topics discussed in a document. In this paper we show that even a simple algorithm that exploits only the titles and categories of Wikipedia articles can characterize documents by Wikipedia categories surprisingly well. We test the reliability of our method by predicting categories of Wikipedia articles themselves based on their bodies, and by performing classification and clustering on 20 newsgroups and RCV1, representing documents by their Wikipedia categories instead of their texts
Keywords
Web sites; document handling; encyclopaedias; ontologies (artificial intelligence); pattern classification; pattern clustering; Wikipedia category network; document topics; newsgroups; online encyclopedia; ontology; Automation; Clustering algorithms; Computer networks; Content based retrieval; Encyclopedias; Information retrieval; Ontologies; Taxonomy; Testing; Wikipedia;
fLanguage
English
Publisher
ieee
Conference_Titel
Web Intelligence, 2006. WI 2006. IEEE/WIC/ACM International Conference on
Conference_Location
Hong Kong
Print_ISBN
0-7695-2747-7
Type
conf
DOI
10.1109/WI.2006.92
Filename
4061411
Link To Document