Title :
Learning text classifier using the domain concept hierarchy
Author :
Wang, Bill B. ; McKay, Ri Bob ; Abbass, Hussein A. ; Barlow, Michael
Author_Institution :
Sch. of Comput. Sci., Univ. of New South Wales, Canberra, ACT, Australia
fDate :
29 June-1 July 2002
Abstract :
Automatic text categorization is an important component in many information organization and management tasks. Research has shown that similarity based categorization algorithms like K-nearest neighbour (KNN) are effective in document categorization. These algorithms use index terms to represent documents. However some drawbacks persecute these algorithms. One major drawback is that they tend to use all features when computing the similarities, which implies that they must search in a high-dimensional space. Another major drawback is that they tend to use a very large training document set so that all terms, which are important to identify content of documents, are covered. To overcome these drawbacks, in this paper, we present a novel method to search for the optimal representation in a domain ontology hierarchical structure to reflect concepts for the taxonomic standard for pre-defined categories. Experiments have shown this is a feasible method to reduce the dimensionality of the document vector space effectively and reasonably and consequently improves the generalisation power of the derived classifier. The result is a classification method which is both very significantly less costly, in computation terms, and yet of considerably higher accuracy than comparable methods.
Keywords :
classification; indexing; learning (artificial intelligence); search problems; text analysis; vocabulary; K-nearest neighbour algorithms; KNN document categorization; automatic text categorization; classifier generalisation power; document vector space dimensionality reduction; domain concept hierarchy learning text classifiers; domain ontology hierarchical structures; heuristic search algorithms; high-dimensional space search; index terms; information organization; optimal concept representations; pre-defined category taxonomic standards; semantics; similarity based categorization algorithms; training document sets; Computer science; Decision trees; Educational institutions; Information management; Internet; Natural languages; Neural networks; Ontologies; Software libraries; Text categorization;
Conference_Titel :
Communications, Circuits and Systems and West Sino Expositions, IEEE 2002 International Conference on
Print_ISBN :
0-7803-7547-5
DOI :
10.1109/ICCCAS.2002.1179005