Title :
Mining Wikipedia Knowledge to improve document indexing and classification
Author :
Ayyasamy, Ramesh Kumar ; Tahayna, Bashar ; Alhashmi, Saadat ; Eu-Gene, Siew ; Egerton, Simon
Author_Institution :
Sch. of IT, Monash Univ., Clayton, VIC, Australia
Abstract :
Weblogs are an important source of information that requires automatic techniques to categorize them into “topic-based” content, to facilitate their future browsing and retrieval. In this paper we propose and illustrate the effectiveness of a new tf. idf measure. The proposed Conf.idf, Catf.idf measures are solely based on the mapping of terms-to-concepts-to-categories (TCONCAT) method that utilizes Wikipedia. The Knowledge base-Wikipedia is considered as a large scale Web encyclopaedia, that has high-quality and huge number of articles and categorical indexes. Using this system, our proposed framework consists of two stages to solve weblog classification problem. The first stage is to find out the terms belonging to a unique concept (article), as well as to disambiguate the terms belonging to more than one concept. The second stage is the determination of the categories to which these found concepts belong to. Experimental result confirms that, proposed system can distinguish the weblogs that belongs to more than one category efficiently and has a better performance and success than the traditional statistical Natural Language Processing-NLP approaches.
Keywords :
Internet; Web sites; data mining; document handling; indexing; information retrieval; pattern classification; TCONCAT; Web encyclopaedia; document classification improvement; document indexing improvement; mining wikipedia knowledge; statistical natural language processing; terms-to-concepts-to-categories; weblog classification problem; Biological system modeling; Blogs; Educational institutions; Indexing; Kernel; Knowledge based systems; Organizations; Concetps; N-grams; Text Classification; Wikipedia;
Conference_Titel :
Information Sciences Signal Processing and their Applications (ISSPA), 2010 10th International Conference on
Conference_Location :
Kuala Lumpur
Print_ISBN :
978-1-4244-7165-2
DOI :
10.1109/ISSPA.2010.5605508