Title :
Optimal Hash List for Word Frequency Analysis
Author_Institution :
Dept. of Inf. Eng., JDZ Ceramic Inst., Jingdezhen, China
Abstract :
Word frequency analysis plays an essential role in many data mining tasks of large-scale data set based on text corpus, and hash list is a very simple but efficient structure for frequent pattern discovering. In this paper, a Poisson approximation approach is exploited to analyze the space efficiency of hash list under different parameters on probability. Based on our theoretical model, an optimal parameter setting for hash list is given. Experimental result of real data shows that hash list with the optimal parameter can reach minimum or nearly minimum memory cost.
Keywords :
approximation theory; stochastic processes; text analysis; word processing; Poisson approximation approach; data mining tasks; frequent pattern discovery; hash list; text corpus; word frequency analysis; Poisson approximation; hash list; space efficiency; word frequency;
Conference_Titel :
Web Information Systems and Mining (WISM), 2010 International Conference on
Conference_Location :
Sanya
Print_ISBN :
978-1-4244-8438-6
DOI :
10.1109/WISM.2010.59