Title :
Extracting unknown words from Sina Weibo via data clustering
Author :
Lei, Kai ; Zhang, WeiYang ; Zhang, Kai ; Xu, Kuai
Author_Institution :
Institute of Big Data Technologies, Shenzhen Key Lab for Cloud Computing Technology & Applications, School of Electronics and Computer Engineering(SECE), Peking University, China
Abstract :
Sina Weibo, a Twitter-like microblogging site attracting over 240 million monthly active users to tweet, retweet, and comment, has rapidly become one of the most popular social media sites in China. As many users create new and innovative words on their tweets and comments, it is necessary to extract these emerging words, which do not exist in today´s Chinese vocabulary or dictionary. Towards this end, this paper proposes a novel method based on data clustering of Weibo users and tweets for extracting unknown words from Weibo tweets and comments. Specifically, relying on the similarity of the users who post the tweets, we apply a hierarchical clustering to divide Weibo data into distinct groups, e.g., sports, news stories, movies, before extraction. Comparing with the method of unclustered Weibo data, our experimental results have successfully demonstrated the benefits of the proposed data clustering scheme for improving the recall and accuracy of extracting unknown Chinese words from tweets and comments.
Keywords :
Clustering algorithms; Clustering methods; Context; Couplings; Data mining; Entropy; Social network services;
Conference_Titel :
Communications (ICC), 2015 IEEE International Conference on
Conference_Location :
London, United Kingdom
DOI :
10.1109/ICC.2015.7248483