DocumentCode :
3578606
Title :
A method for automated document classification using Wikipedia-derived weighted keywords
Author :
Biuk-Aghai, Robert P. ; Ka Kit Ng
Author_Institution :
Dept. of Comput. & Inf. Sci., Univ. of Macau, Macau, China
fYear :
2014
Firstpage :
1
Lastpage :
6
Abstract :
The pace of knowledge creation such as in academic research has accelerated rapidly in recent years, resulting in ever more new research publications. This has made it difficult to keep abreast of new developments, or to know which new publications are relevant to a given research area. We have developed a method for analysing and automatically classifying publications. Our method makes use of the Wikipedia category hierarchy, and the content of Wikipedia articles associated to Wikipedia categories. Initially we perform pre-processing and simplification of the Wikipedia category hierarchy, resulting in a rooted directed graph. Wikipedia articles are then analysed, and a set of keywords per Wikipedia category are extracted using a modified tf-idf (term frequency-inverse document frequency) model proposed in this paper. To classify a given input document, tf-idf weights are used to extract relevant keywords from the document, which are then matched to the keywords previously extracted from Wikipedia. The closest matching top-level categories are identified from all categories containing the document´s keywords. A cosine similarity metric is then applied to select the closest matching sub-category, recursing down the category hierarchy until the best matching categories are identified. The final result produced shows a set of categories matching the input document, together with a matching percentage. This result can be used to identify new documents that are relevant to a specific research area, or to classify a whole set of documents into different topic areas, with sub-topics, main keywords, and associated weights. We present an experimental study using data from English Wikipedia.
Keywords :
Web sites; classification; data mining; directed graphs; document handling; Wikipedia category hierarchy; Wikipedia-derived weighted keyword; academic research; automated document classification; cosine similarity metric; knowledge creation; publication classification; rooted directed graph; term frequency-inverse document frequency; Business; Electronic publishing; Encyclopedias; History; Internet; Vectors;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Data and Software Engineering (ICODSE), 2014 International Conference on
Print_ISBN :
978-1-4799-8175-5
Type :
conf
DOI :
10.1109/ICODSE.2014.7062484
Filename :
7062484
Link To Document :
بازگشت