DocumentCode :
679547
Title :
Most Clusters Can Be Retrieved with Short Disjunctive Queries
Author :
Deolalikar, Vinay
fYear :
2013
fDate :
7-10 Dec. 2013
Firstpage :
1019
Lastpage :
1024
Abstract :
Simple keyword based searches are ubiquitous in today´s internet age. It is hard to imagine an information system today that does not permit a simple keyword based search. This method of information retrieval has the obvious benefits of being highly interpretable, and having wide usage. However, a general perception is that keyword search may not be as powerful an information retrieval paradigm as those that utilize data mining technologies. At the same time, the tremendous growth in textual information in various domains has also given impetus to data mining technologies such as document clustering. Document clustering is a powerful technique, having wide applications in enterprise information management (EIM). However, there is a general perception that the clusters it produces are not always easily interpretable. This hampers its usage in certain settings. This leads us to the following question: can we retrieve a cluster (from a corpus) using a keyword search with precision and recall that are reasonable from the point of view of a retrieval system? What is the form of such a keyword search? How many keywords do we require? How do we arrive at these keywords? Not only are these questions natural, they have immediate use in several highly regulated applications in EIM such as eDiscovery and compliance, where document sets must be specified using keywords. In order to answer our question, we construct a framework that uses maximal frequent discriminative item sets. The novelty of our usage of these item sets is that although their definition as frequent item sets is conjunctive, we use them to form a disjunctive query upon the corpus. We then study the results of this query as an information retrieval problem whose target is the cluster. Our study yields a surprising result: most clusters can be retrieved, up to reasonable precision and recall, using a disjunctive query of only three terms. Among other ramifications, this gives us a readily interpretable descrip- ion of a cluster in terms of the disjunctive query that returns it.
Keywords :
information systems; pattern clustering; query processing; text analysis; EIM; Internet; compliance; data mining technology; document clustering; document sets; e-discovery; enterprise information management; information retrieval method; information system; keyword based searches; maximal frequent discriminative item sets; short disjunctive query; textual information; Clustering algorithms; Data mining; Itemsets; Keyword search; Organizations; Standards; Disjunctive Queries; Document Clustering; Frequent Itemsets; Keyword Search; Retrieval;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Data Mining (ICDM), 2013 IEEE 13th International Conference on
Conference_Location :
Dallas, TX
ISSN :
1550-4786
Type :
conf
DOI :
10.1109/ICDM.2013.94
Filename :
6729591
Link To Document :
بازگشت