• DocumentCode
    1938276
  • Title

    An Improved Document Classification Approach with Maximum Entropy and Entropy Feature Selection

  • Author

    Pang, Xiu-Li ; Feng, Yu-qiang ; Jiang, Wei

  • Author_Institution
    Harbin Inst. of Technol., Harbin
  • Volume
    7
  • fYear
    2007
  • fDate
    19-22 Aug. 2007
  • Firstpage
    3911
  • Lastpage
    3915
  • Abstract
    Document classification is an important task in the field of document management. Bayesian model needs the feature independent assumption; artificial neural network suffers from the overfitting problem; support vector machine (SVM) does not do well in real-value feature. This paper proposes to combine entropy and machine learning techniques for document classification. Firstly, the cross entropy and average mutual information are presented to effectively extract the features. Secondly, the support vector machine and maximum entropy model is adopted respectively to perform the classification task in the feature space. Furthermore, an improved feature description instead the binary feature with the real-value is presented in this text, since the prior knowledge of each word is helpful to document classification. Finally, we compare our method with the traditional methods, and the experiment showed our method increased 2.78 % F-measures than basic ME model, and 0.95% than naive Bayes model which was smoothed by Good-Turing algorithm.
  • Keywords
    Bayes methods; document handling; feature extraction; learning (artificial intelligence); maximum entropy methods; neural nets; smoothing methods; support vector machines; Bayesian model; Good-Turing algorithm; SVM; artificial neural network; document classification; document management; entropy feature selection; feature extraction; machine learning; maximum entropy; mutual information; smoothing; support vector machine; Bayesian methods; Conference management; Cybernetics; Entropy; Feature extraction; Machine learning; Support vector machine classification; Support vector machines; Technology management; Testing; Document classification; Entropy; Feature extraction; Maximum entropy model; Support vector machine;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Machine Learning and Cybernetics, 2007 International Conference on
  • Conference_Location
    Hong Kong
  • Print_ISBN
    978-1-4244-0973-0
  • Electronic_ISBN
    978-1-4244-0973-0
  • Type

    conf

  • DOI
    10.1109/ICMLC.2007.4370829
  • Filename
    4370829