Title :
Query-relevant document representation for text clustering
Author :
Makrehchi, Masoud
Author_Institution :
Thomson Reuters, Eagan, MN, USA
Abstract :
In text categorization, one well-known document representation is bag-of-words. Although it is simple and popular, it ignores semantics, underlying linguistic information, and word correlations. In this paper, a new representation for text data is proposed which is called Bag-Of-Queries (BOQ). First, a taxonomy of the terms in the local vocabulary is extracted. Extracting a taxonomy is performed by learning term dependencies using an information theoretic inclusion index. Next, the taxonomy is partitioned to generate a set of correlated terms or bag of queries. Since every two partitions belong to different concepts, they are considered semantically orthogonal queries. This provides a new space of orthogonal features, which is necessary for an efficient categorization. Finally, instead of using terms as features, we use them to build a set of queries. Documents are ranked in response to the queries using a similarity measure. The similarity indices are considered as new features in a vector space model representation. The proposed approach outperforms bag of word based clustering. It also extracts new non-redundant features and at the same time reduces dimensionality.
Keywords :
indexing; learning (artificial intelligence); pattern clustering; query processing; text analysis; vocabulary; bag-of-queries; bag-of-words; information theoretic inclusion index; learning; linguistic information; local vocabulary; query-relevant document representation; semantically orthogonal queries; text categorization; text clustering; vector space model representation; word correlations; Indexes; Mutual information; Semantics; Taxonomy; Text categorization; Vocabulary;
Conference_Titel :
Digital Information Management (ICDIM), 2010 Fifth International Conference on
Conference_Location :
Thunder Bay, ON
Print_ISBN :
978-1-4244-7572-8
DOI :
10.1109/ICDIM.2010.5664205