DocumentCode
2542665
Title
An optimized features extraction algorithm on VSM
Author
Kui Fang ; Juan Wang
Author_Institution
Coll. of Inf. Sci. & Technol., Hunan Agric. Univ., Changsha, China
fYear
2012
fDate
29-31 May 2012
Firstpage
1471
Lastpage
1473
Abstract
VSM (Vector Space Model) is one of the important methods for describing documents. However, in the process of information representation, features are always high dimensional. So feature extraction technologies have to be used to reduce dimensions. At present, there are lots of feature extraction algorithms, in which TF-IDF,TF-IDF-IG are used widely in practice. However, as the two didn´t consider the influence of text categories and the structure of HTML sufficiently, which greatly affects the accuracy and applicability of the algorithms. To this issue, we proposed an optimized feature extraction algorithm. Meanwhile, we introduced a modifying factor into the novel algorithm to avoid the data imbalance problem which results from magnitude of categories. Through the experiment, the proposed algorithm was compared with the TF-IDF and TF-IDF-IG. We found that the precision and recall of the new algorithm are separately increased more than 10.4% and 13.8% than TF-IDF, and 4.6% and 2.9% than TF-IDF-IG, which shows the novel algorithm has better precision and recall.
Keywords
document handling; feature extraction; information retrieval; vectors; TF-IDF algorithm; TF-IDF-IG algorithm; VSM; data imbalance problem avoidance; dimension reduction; documents representation; information representation process; information retrieval; optimized feature extraction algorithm; vector space model; Algorithm design and analysis; Classification algorithms; Educational institutions; Feature extraction; HTML; Information processing; Text categorization; TF-IDF; TF-IDF-IG; features extraction;
fLanguage
English
Publisher
ieee
Conference_Titel
Fuzzy Systems and Knowledge Discovery (FSKD), 2012 9th International Conference on
Conference_Location
Sichuan
Print_ISBN
978-1-4673-0025-4
Type
conf
DOI
10.1109/FSKD.2012.6233810
Filename
6233810
Link To Document