Title :
A Refined TF-IDF Algorithm Based on Channel Distribution Information for Web News Feature Extraction
Author :
Xu, Mingmin ; He, Liang ; Lin, Xin
Author_Institution :
Comput. Sci. & Technol. Dept., East China Normal Univ., Shanghai, China
Abstract :
TF-IDF algorithm is widely used in text feature extraction, in which IDF value demonstrates the importance of a term. While applying to the procession of web news, the traditional IDF doesn´t work well, especially in a collection divided according to channels. In order to solve this problem, a refined IDF schema is proposed, named Channel Distribution Information (CDI) IDF, which is based on the information among the IDF values of each channel collections. According to the statistical features, the Top terms and the meaningless terms could be identified. Experiments on a manual labeled test set indicated that, related to the traditional TF-IDF, the CDI TF-IDF increases the Recall, Precise and F0.5 measure by 2.71%, 3.07% and 3.00%.
Keywords :
Internet; feature extraction; information filtering; statistical analysis; text analysis; Web news feature extraction; channel distribution information; refined TF-IDF algorithm; statistical features; text feature extraction; Computer science; Computer science education; Distributed computing; Educational technology; Feature extraction; Frequency; Helium; Internet; TV; Testing; TF-IDF; channel distribution information; feature extraction;
Conference_Titel :
Education Technology and Computer Science (ETCS), 2010 Second International Workshop on
Conference_Location :
Wuhan
Print_ISBN :
978-1-4244-6388-6
Electronic_ISBN :
978-1-4244-6389-3
DOI :
10.1109/ETCS.2010.130