TopCat: data mining for topic identification in a text corpus

Author

Clifton, Chris ; Cooley, Robert ; Rennie, Jason

Author_Institution

Dept. of Comput. Sci., Purdue Univ., West Lafayette, IN, USA

Volume

16

Issue

8

fYear

2004

Firstpage

949

Lastpage

964

Abstract

TopCat (topic categories) is a technique for identifying topics that recur in articles in a text corpus. Natural language processing techniques are used to identify key entities in individual articles, allowing us to represent an article as a set of items. This allows us to view the problem in a database/data mining context: Identifying related groups of items. We present a novel method for identifying related items based on traditional data mining techniques. Frequent itemsets are generated from the groups of items, followed by clusters formed with a hypergraph partitioning scheme. We present an evaluation against a manually categorized ground truth news corpus; it shows this technique is effective in identifying topics in collections of news articles.

Keywords

data mining; information retrieval; natural languages; pattern clustering; text analysis; very large databases; TopCat; data mining; database; frequent item sets; hypergraph partitioning scheme; natural language processing techniques; text corpus; topic categories; Association rules; Data mining; Databases; Filtering; Humans; Information retrieval; Itemsets; Natural language processing; Natural languages; Performance analysis; 65; Topic detection; clustering.; data mining;

fLanguage

English

Journal_Title

Knowledge and Data Engineering, IEEE Transactions on

Publisher

ieee

ISSN

1041-4347

Type

jour

DOI

10.1109/TKDE.2004.32

Filename

1318580