DocumentCode :
1048168
Title :
TopCat: data mining for topic identification in a text corpus
Author :
Clifton, Chris ; Cooley, Robert ; Rennie, Jason
Author_Institution :
Dept. of Comput. Sci., Purdue Univ., West Lafayette, IN, USA
Volume :
16
Issue :
8
fYear :
2004
Firstpage :
949
Lastpage :
964
Abstract :
TopCat (topic categories) is a technique for identifying topics that recur in articles in a text corpus. Natural language processing techniques are used to identify key entities in individual articles, allowing us to represent an article as a set of items. This allows us to view the problem in a database/data mining context: Identifying related groups of items. We present a novel method for identifying related items based on traditional data mining techniques. Frequent itemsets are generated from the groups of items, followed by clusters formed with a hypergraph partitioning scheme. We present an evaluation against a manually categorized ground truth news corpus; it shows this technique is effective in identifying topics in collections of news articles.
Keywords :
data mining; information retrieval; natural languages; pattern clustering; text analysis; very large databases; TopCat; data mining; database; frequent item sets; hypergraph partitioning scheme; natural language processing techniques; text corpus; topic categories; Association rules; Data mining; Databases; Filtering; Humans; Information retrieval; Itemsets; Natural language processing; Natural languages; Performance analysis; 65; Topic detection; clustering.; data mining;
fLanguage :
English
Journal_Title :
Knowledge and Data Engineering, IEEE Transactions on
Publisher :
ieee
ISSN :
1041-4347
Type :
jour
DOI :
10.1109/TKDE.2004.32
Filename :
1318580
Link To Document :
بازگشت