• DocumentCode
    2915627
  • Title

    A new text classification technique using small training sets

  • Author

    Clarizia, Fabio ; Colace, Francesco ; De Santo, Massimo ; Greco, Luca ; Napoletano, Paolo

  • Author_Institution
    Dept. of Electron. Eng. & Comput. Eng., Univ. of Salerno, Fisciano, Italy
  • fYear
    2011
  • fDate
    22-24 Nov. 2011
  • Firstpage
    1038
  • Lastpage
    1043
  • Abstract
    Text classification methods have been evaluated on supervised classification tasks of large datasets showing high accuracy. Nevertheless, due to the fact that these classifiers, to obtain a good performance on a test set, need to learn from many examples, some difficulties may be found when they are employed in real contexts. In fact, most users of a practical system do not want to carry out labeling tasks for a long time only to obtain a better level of accuracy. They obviously prefer algorithms that have high accuracy, but do not require a large amount of manual labeling tasks. In this paper we propose a new supervised method for single-label text classification, based on a mixed Graph of Terms, that is capable of achieving a good performance, in term of accuracy, when the size of the training set is 1% of the original. The mixed Graph of Terms can be automatically extracted from a set of documents following a kind of term clustering technique weighted by the probabilistic topic model. The method has been tested on the top 10 classes of the ModApte split from the Reuters-21578 dataset and learned on 1% of the original training set. Results have confirmed the discriminative property of the graph and have confirmed that the proposed method is comparable with existing methods learned on the whole training set.
  • Keywords
    graph theory; pattern classification; pattern clustering; text analysis; ModApte; Reuters-21578; documents; graph of terms; manual labeling tasks; single label text classification; small training sets; supervised classification tasks; term clustering technique; text classification technique; Accuracy; Feature extraction; Intelligent systems; Probabilistic logic; Semantics; Training; Vectors; Text classification; probabilistic topic model; term extraction;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Intelligent Systems Design and Applications (ISDA), 2011 11th International Conference on
  • Conference_Location
    Cordoba
  • ISSN
    2164-7143
  • Print_ISBN
    978-1-4577-1676-8
  • Type

    conf

  • DOI
    10.1109/ISDA.2011.6121795
  • Filename
    6121795