DocumentCode
3169451
Title
Learning text classifier using the domain concept hierarchy
Author
Wang, Bill B. ; McKay, Ri Bob ; Abbass, Hussein A. ; Barlow, Michael
Author_Institution
Sch. of Comput. Sci., Univ. of New South Wales, Canberra, ACT, Australia
Volume
2
fYear
2002
fDate
29 June-1 July 2002
Firstpage
1230
Abstract
Automatic text categorization is an important component in many information organization and management tasks. Research has shown that similarity based categorization algorithms like K-nearest neighbour (KNN) are effective in document categorization. These algorithms use index terms to represent documents. However some drawbacks persecute these algorithms. One major drawback is that they tend to use all features when computing the similarities, which implies that they must search in a high-dimensional space. Another major drawback is that they tend to use a very large training document set so that all terms, which are important to identify content of documents, are covered. To overcome these drawbacks, in this paper, we present a novel method to search for the optimal representation in a domain ontology hierarchical structure to reflect concepts for the taxonomic standard for pre-defined categories. Experiments have shown this is a feasible method to reduce the dimensionality of the document vector space effectively and reasonably and consequently improves the generalisation power of the derived classifier. The result is a classification method which is both very significantly less costly, in computation terms, and yet of considerably higher accuracy than comparable methods.
Keywords
classification; indexing; learning (artificial intelligence); search problems; text analysis; vocabulary; K-nearest neighbour algorithms; KNN document categorization; automatic text categorization; classifier generalisation power; document vector space dimensionality reduction; domain concept hierarchy learning text classifiers; domain ontology hierarchical structures; heuristic search algorithms; high-dimensional space search; index terms; information organization; optimal concept representations; pre-defined category taxonomic standards; semantics; similarity based categorization algorithms; training document sets; Computer science; Decision trees; Educational institutions; Information management; Internet; Natural languages; Neural networks; Ontologies; Software libraries; Text categorization;
fLanguage
English
Publisher
ieee
Conference_Titel
Communications, Circuits and Systems and West Sino Expositions, IEEE 2002 International Conference on
Print_ISBN
0-7803-7547-5
Type
conf
DOI
10.1109/ICCCAS.2002.1179005
Filename
1179005
Link To Document