Title :
Feature selection for text clustering in limited memory using Monte Carlo wrapper
Author :
Deolalikar, Vinay
Abstract :
Feature selection is a natural choice of technique to scale up clustering. However, feature selection for clustering has received scant research attention compared to the vast literature on feature selection for classification. This is due to the absence of class labels in clustering. The approaches for feature selection for text clustering proposed in literature require (repeated) clustering of the entire corpus. We demonstrate that we can, instead, apply Monte Carlo (MC) techniques to “collect” features by inspecting only tiny portions of the entire dataset at any time. We do not require any global information about features to be computed over the entire dataset. We evaluate the trade-offs between the various cost parameters of the MC with the efficacy of the clustering produced and the time taken to produce it. Doing feature selection using MC, we are able to perform a clustering three to four times faster on standard text benchmarks. Such speed-up can enable near-real time clustering in enterprise workflows using limited memory.
Keywords :
Monte Carlo methods; feature selection; pattern clustering; text analysis; MC; Monte Carlo wrapper; feature selection; text clustering; Approximation algorithms; Approximation methods; Benchmark testing; Clustering algorithms; Entropy; Feature extraction; Monte Carlo methods;
Conference_Titel :
Big Data (Big Data), 2014 IEEE International Conference on
Conference_Location :
Washington, DC
DOI :
10.1109/BigData.2014.7004355