DocumentCode :
2915653
Title :
TF-SIDF: Term frequency, sketched inverse document frequency
Author :
Baena-García, Manuel ; Carmona-Cejudo, José M. ; Castillo, Gladys ; Morales-Bueno, Rafael
Author_Institution :
Dipt. Lenguajes y Cienc. de la Comput., Univ. de Malaga, Malaga, Spain
fYear :
2011
fDate :
22-24 Nov. 2011
Firstpage :
1044
Lastpage :
1049
Abstract :
Exact calculation of the TF-IDF weighting function in massive streams of documents involves challenging memory space requirements. In this work, we propose TF-SIDF, a novel solution for extracting relevant words from streams of documents with a high number of terms. TF-SIDF relies on the Count-Min Sketch data structure, which allows to estimate the counts of all the terms in the stream. Results of the experiments conducted with two dataset show that this sketch-based algorithm achieves good approximations of the TF-IDF weighting values (as a rule, the top terms with highest TF-IDF values remaining the same), while substantial savings in memory usage are observed. It is also observed that the performance is highly correlated with the sketch size, and that wider sketch configurations are preferable given the same sketch size.
Keywords :
data mining; data structures; storage management; text analysis; TF-IDF weighting function; TF-IDF weighting values; TF-SIDF; count-min sketch data structure; exact calculation; massive streams; memory space requirements; memory usage; sketch configurations; sketch size; sketch-based algorithm; sketched inverse document frequency; term frequency; Approximation methods; Correlation; Data structures; Graphics; Intelligent systems; Measurement; Radiation detectors; count-min sketch; text mining; tfidf;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Intelligent Systems Design and Applications (ISDA), 2011 11th International Conference on
Conference_Location :
Cordoba
ISSN :
2164-7143
Print_ISBN :
978-1-4577-1676-8
Type :
conf
DOI :
10.1109/ISDA.2011.6121796
Filename :
6121796
Link To Document :
بازگشت