DocumentCode
1048168
Title
TopCat: data mining for topic identification in a text corpus
Author
Clifton, Chris ; Cooley, Robert ; Rennie, Jason
Author_Institution
Dept. of Comput. Sci., Purdue Univ., West Lafayette, IN, USA
Volume
16
Issue
8
fYear
2004
Firstpage
949
Lastpage
964
Abstract
TopCat (topic categories) is a technique for identifying topics that recur in articles in a text corpus. Natural language processing techniques are used to identify key entities in individual articles, allowing us to represent an article as a set of items. This allows us to view the problem in a database/data mining context: Identifying related groups of items. We present a novel method for identifying related items based on traditional data mining techniques. Frequent itemsets are generated from the groups of items, followed by clusters formed with a hypergraph partitioning scheme. We present an evaluation against a manually categorized ground truth news corpus; it shows this technique is effective in identifying topics in collections of news articles.
Keywords
data mining; information retrieval; natural languages; pattern clustering; text analysis; very large databases; TopCat; data mining; database; frequent item sets; hypergraph partitioning scheme; natural language processing techniques; text corpus; topic categories; Association rules; Data mining; Databases; Filtering; Humans; Information retrieval; Itemsets; Natural language processing; Natural languages; Performance analysis; 65; Topic detection; clustering.; data mining;
fLanguage
English
Journal_Title
Knowledge and Data Engineering, IEEE Transactions on
Publisher
ieee
ISSN
1041-4347
Type
jour
DOI
10.1109/TKDE.2004.32
Filename
1318580
Link To Document