DocumentCode :
3166291
Title :
Using Burstiness to Improve Clustering of Topics in News Streams
Author :
He, Qi ; Chang, Kuiyu ; Lim, Ee-Peng
Author_Institution :
Nanyang Technol. Univ., Nanyang Avenue
fYear :
2007
fDate :
28-31 Oct. 2007
Firstpage :
493
Lastpage :
498
Abstract :
Specialists who analyze online news have a hard time separating the wheat from the chaff. Moreover, automatic data-mining techniques like clustering of news streams into topical groups can fully recover the underlying true class labels of data if and only if all classes are well separated. In reality, especially for news streams, this is clearly not the case. The question to ask is thus this: if we cannot recover the full C classes by clustering, what is the largest K < C clusters we can find that best resemble the K underlying classes? Using the intuition that bursty topics are more likely to correspond to important events that are of interest to analysts, we propose several new bursty vector space models (B-VSM)for representing a news document. B-VSM takes into account the burstiness (across the full corpus and whole duration) of each constituent word in a document at the time of publication. We benchmarked our B-VSM against the classical TFIDF-VSM on the task of clustering a collection of news stream articles with known topic labels. Experimental results show that B-VSM was able to find the burstiest clusters/topics. Further, it also significantly improved the recall and precision for the top K clusters/topics.
Keywords :
data mining; document handling; information resources; media streaming; automatic data mining; burstiness; bursty topics; bursty vector space model; news document representation; news stream article; news stream clustering; online news analysis; topic label; topics clustering; Clustering methods; Data engineering; Data mining; Functional analysis; Helium; Nominations and elections; Organizing; Telecommunication traffic;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Data Mining, 2007. ICDM 2007. Seventh IEEE International Conference on
Conference_Location :
Omaha, NE
ISSN :
1550-4786
Print_ISBN :
978-0-7695-3018-5
Type :
conf
DOI :
10.1109/ICDM.2007.17
Filename :
4470279
Link To Document :
بازگشت