مرکز منطقه ای اطلاع رساني علوم و فناوري - Feature selection for text clustering in limited memory using Monte Carlo wrapper

DocumentCode :

1791676

Title :

Feature selection for text clustering in limited memory using Monte Carlo wrapper

Author :

Deolalikar, Vinay

fYear :

2014

fDate :

27-30 Oct. 2014

Firstpage :

Lastpage :

Abstract :

Feature selection is a natural choice of technique to scale up clustering. However, feature selection for clustering has received scant research attention compared to the vast literature on feature selection for classification. This is due to the absence of class labels in clustering. The approaches for feature selection for text clustering proposed in literature require (repeated) clustering of the entire corpus. We demonstrate that we can, instead, apply Monte Carlo (MC) techniques to “collect” features by inspecting only tiny portions of the entire dataset at any time. We do not require any global information about features to be computed over the entire dataset. We evaluate the trade-offs between the various cost parameters of the MC with the efficacy of the clustering produced and the time taken to produce it. Doing feature selection using MC, we are able to perform a clustering three to four times faster on standard text benchmarks. Such speed-up can enable near-real time clustering in enterprise workflows using limited memory.

Keywords :

Monte Carlo methods; feature selection; pattern clustering; text analysis; MC; Monte Carlo wrapper; feature selection; text clustering; Approximation algorithms; Approximation methods; Benchmark testing; Clustering algorithms; Entropy; Feature extraction; Monte Carlo methods;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Big Data (Big Data), 2014 IEEE International Conference on

Conference_Location :

Washington, DC

Type :

conf

DOI :

10.1109/BigData.2014.7004355

Filename :

7004355

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=1791676