• DocumentCode
    3169451
  • Title

    Learning text classifier using the domain concept hierarchy

  • Author

    Wang, Bill B. ; McKay, Ri Bob ; Abbass, Hussein A. ; Barlow, Michael

  • Author_Institution
    Sch. of Comput. Sci., Univ. of New South Wales, Canberra, ACT, Australia
  • Volume
    2
  • fYear
    2002
  • fDate
    29 June-1 July 2002
  • Firstpage
    1230
  • Abstract
    Automatic text categorization is an important component in many information organization and management tasks. Research has shown that similarity based categorization algorithms like K-nearest neighbour (KNN) are effective in document categorization. These algorithms use index terms to represent documents. However some drawbacks persecute these algorithms. One major drawback is that they tend to use all features when computing the similarities, which implies that they must search in a high-dimensional space. Another major drawback is that they tend to use a very large training document set so that all terms, which are important to identify content of documents, are covered. To overcome these drawbacks, in this paper, we present a novel method to search for the optimal representation in a domain ontology hierarchical structure to reflect concepts for the taxonomic standard for pre-defined categories. Experiments have shown this is a feasible method to reduce the dimensionality of the document vector space effectively and reasonably and consequently improves the generalisation power of the derived classifier. The result is a classification method which is both very significantly less costly, in computation terms, and yet of considerably higher accuracy than comparable methods.
  • Keywords
    classification; indexing; learning (artificial intelligence); search problems; text analysis; vocabulary; K-nearest neighbour algorithms; KNN document categorization; automatic text categorization; classifier generalisation power; document vector space dimensionality reduction; domain concept hierarchy learning text classifiers; domain ontology hierarchical structures; heuristic search algorithms; high-dimensional space search; index terms; information organization; optimal concept representations; pre-defined category taxonomic standards; semantics; similarity based categorization algorithms; training document sets; Computer science; Decision trees; Educational institutions; Information management; Internet; Natural languages; Neural networks; Ontologies; Software libraries; Text categorization;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Communications, Circuits and Systems and West Sino Expositions, IEEE 2002 International Conference on
  • Print_ISBN
    0-7803-7547-5
  • Type

    conf

  • DOI
    10.1109/ICCCAS.2002.1179005
  • Filename
    1179005