• DocumentCode
    2542665
  • Title

    An optimized features extraction algorithm on VSM

  • Author

    Kui Fang ; Juan Wang

  • Author_Institution
    Coll. of Inf. Sci. & Technol., Hunan Agric. Univ., Changsha, China
  • fYear
    2012
  • fDate
    29-31 May 2012
  • Firstpage
    1471
  • Lastpage
    1473
  • Abstract
    VSM (Vector Space Model) is one of the important methods for describing documents. However, in the process of information representation, features are always high dimensional. So feature extraction technologies have to be used to reduce dimensions. At present, there are lots of feature extraction algorithms, in which TF-IDF,TF-IDF-IG are used widely in practice. However, as the two didn´t consider the influence of text categories and the structure of HTML sufficiently, which greatly affects the accuracy and applicability of the algorithms. To this issue, we proposed an optimized feature extraction algorithm. Meanwhile, we introduced a modifying factor into the novel algorithm to avoid the data imbalance problem which results from magnitude of categories. Through the experiment, the proposed algorithm was compared with the TF-IDF and TF-IDF-IG. We found that the precision and recall of the new algorithm are separately increased more than 10.4% and 13.8% than TF-IDF, and 4.6% and 2.9% than TF-IDF-IG, which shows the novel algorithm has better precision and recall.
  • Keywords
    document handling; feature extraction; information retrieval; vectors; TF-IDF algorithm; TF-IDF-IG algorithm; VSM; data imbalance problem avoidance; dimension reduction; documents representation; information representation process; information retrieval; optimized feature extraction algorithm; vector space model; Algorithm design and analysis; Classification algorithms; Educational institutions; Feature extraction; HTML; Information processing; Text categorization; TF-IDF; TF-IDF-IG; features extraction;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Fuzzy Systems and Knowledge Discovery (FSKD), 2012 9th International Conference on
  • Conference_Location
    Sichuan
  • Print_ISBN
    978-1-4673-0025-4
  • Type

    conf

  • DOI
    10.1109/FSKD.2012.6233810
  • Filename
    6233810