DocumentCode :
2830473
Title :
Topic Detection by Clustering Keywords
Author :
Wartena, Christian ; Brussee, Rogier
Author_Institution :
Telematica Inst., Enschede
fYear :
2008
fDate :
1-5 Sept. 2008
Firstpage :
54
Lastpage :
58
Abstract :
We consider topic detection without any prior knowledge of category structure or possible categories. Keywords are extracted and clustered based on different similarity measures using the induced k-bisecting clustering algorithm. Evaluation on Wikipedia articles shows that clusters of keywords correlate strongly with the Wikipedia categories of the articles. In addition, we find that a distance measure based on the Jensen-Shannon divergence of probability distributions outperforms the cosine similarity. In particular, a newly proposed term distribution taking co-occurrence of terms into account gives best results.
Keywords :
information analysis; pattern clustering; statistical distributions; Jensen-Shannon divergence; Wikipedia articles evaluation; cosine similarity; induced k-bisecting clustering algorithm; keywords clustering; keywords extraction; probability distributions; topic detection; Clustering algorithms; Data mining; Databases; Expert systems; Humans; Machine learning; Machine learning algorithms; Probability distribution; Text categorization; Wikipedia; Clustering; Datamining; Jensen Shannon Divergence; Keywords; Natural Language Processing; Topic detection;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Database and Expert Systems Application, 2008. DEXA '08. 19th International Workshop on
Conference_Location :
Turin
ISSN :
1529-4188
Print_ISBN :
978-0-7695-3299-8
Type :
conf
DOI :
10.1109/DEXA.2008.120
Filename :
4624691
Link To Document :
بازگشت