• DocumentCode
    1048168
  • Title

    TopCat: data mining for topic identification in a text corpus

  • Author

    Clifton, Chris ; Cooley, Robert ; Rennie, Jason

  • Author_Institution
    Dept. of Comput. Sci., Purdue Univ., West Lafayette, IN, USA
  • Volume
    16
  • Issue
    8
  • fYear
    2004
  • Firstpage
    949
  • Lastpage
    964
  • Abstract
    TopCat (topic categories) is a technique for identifying topics that recur in articles in a text corpus. Natural language processing techniques are used to identify key entities in individual articles, allowing us to represent an article as a set of items. This allows us to view the problem in a database/data mining context: Identifying related groups of items. We present a novel method for identifying related items based on traditional data mining techniques. Frequent itemsets are generated from the groups of items, followed by clusters formed with a hypergraph partitioning scheme. We present an evaluation against a manually categorized ground truth news corpus; it shows this technique is effective in identifying topics in collections of news articles.
  • Keywords
    data mining; information retrieval; natural languages; pattern clustering; text analysis; very large databases; TopCat; data mining; database; frequent item sets; hypergraph partitioning scheme; natural language processing techniques; text corpus; topic categories; Association rules; Data mining; Databases; Filtering; Humans; Information retrieval; Itemsets; Natural language processing; Natural languages; Performance analysis; 65; Topic detection; clustering.; data mining;
  • fLanguage
    English
  • Journal_Title
    Knowledge and Data Engineering, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    1041-4347
  • Type

    jour

  • DOI
    10.1109/TKDE.2004.32
  • Filename
    1318580