Title :
An Improvement to TF: Term Distribution Based Term Weight Algorithm
Author :
Xia Tian ; Tong, Wang
Author_Institution :
Dept. of Comput. & Inf. Sci., Shanghai Second Polytech. Univ., Shanghai, China
Abstract :
In the process of document formalization, term weight algorithm plays an important role. It greatly interferes the precision and recall results of the natural language processing(NLP) systems. Currently, TF-IDF term weight algorithm is widely applied into language models to build NLP Systems. Since term frequency is not the only discriminator which is necessary to be considered when calculating the term weight and make it suitable to indicate term importance, we are motivated to investigate other statistical characteristics of terms and found an important discriminator: term distribution. Furthermore, we found that a term with higher frequency and close to hypo-dispersion distribution should be given higher weight than one with lower frequency and close to intensive distribution. Based on this hypothesis, by leveraging the Pearson Chi-square Test Statistic, a Term Distribution based Term Weight Algorithm is put forward in this paper. Also, the experiment results at the end of this paper approve the reliability and efficiency of the algorithm.
Keywords :
natural language processing; Pearson chi-square test statistic; natural language processing; term weight algorithm; Computer networks; Computer security; Distributed computing; Frequency; Information retrieval; Information science; Information security; Space technology; Statistical distributions; Wireless communication; IDF; Natural Language Processing; TF; Term Weight;
Conference_Titel :
Networks Security Wireless Communications and Trusted Computing (NSWCTC), 2010 Second International Conference on
Conference_Location :
Wuhan, Hubei
Print_ISBN :
978-0-7695-4011-5
Electronic_ISBN :
978-1-4244-6598-9
DOI :
10.1109/NSWCTC.2010.66