Title :
A Redundancy Based Term Weighting Approach for Text Categorization
Author :
Lu, Zhen-Yu ; Lin, Yong-Min ; Zhao, Shuang ; Chen, Jing-Nian ; Zhu, Wei-Dong
Author_Institution :
Coll. of Econ. & Manage., Hebei Polytech. Univ., Tangshan, China
Abstract :
With the rapid development of World Wide Web, text categorization has played an important role in organizing and processing large amount of text data. TFmiddotIDF is a simple and quick term weighting method, and widely used in text categorization. But the drawback of TFmiddotIDF is large weight may be assigned to rarely appeared terms in despite of the posterior distribution. This paper presents a redundancy based term weighting method to solve this problem by taking posterior probability distribution into consideration. Experiments on Reuters-21578 and Chinese corpus provide by Computer and Information Technology Data Center of Fudan University show that this weighting method has better performance over TFmiddotIDF.
Keywords :
Internet; information retrieval; statistical distributions; text analysis; TF-IDF; World Wide Web; inverse document frequency; posterior probability distribution; redundancy-based term weighting; term frequency; text categorization; Educational institutions; Engineering management; Frequency measurement; Information technology; Organizing; Probability distribution; Software development management; Software engineering; Text categorization; Web sites;
Conference_Titel :
Software Engineering, 2009. WCSE '09. WRI World Congress on
Conference_Location :
Xiamen
Print_ISBN :
978-0-7695-3570-8
DOI :
10.1109/WCSE.2009.191