DocumentCode :
2091088
Title :
Hierarchical Clustering of Large-Scale Short Conversations Based on Domain Ontology
Author :
Wang, Yongheng ; Guo, Bo
Author_Institution :
Inf. Syst. & Manage. Sch., Nat. Univ. of Defense Technol., Changsha, China
Volume :
1
fYear :
2008
fDate :
20-22 Dec. 2008
Firstpage :
126
Lastpage :
130
Abstract :
With the rapid development of the Internet and communication technology, huge data is accumulated. Short text such as conversation in chatting room and email is common in such data. It is useful to cluster such short documents to get the structure of the data or to help building other data mining applications. But most of the current clustering algorithms can not get acceptable clustering accuracy since key words appear with a low frequency in short documents. It is also difficult to process high-dimensional text data in very large databases. In this paper, we propose a hierarchical clustering algorithm which uses domain ontology to improve clustering accuracy. This clustering algorithm is also parallel and frequent-concept based which makes it scalable to very large high-dimensional text data. Our experimental study shows that this algorithm is more accurate than other hierarchical clustering algorithms when clustering short conversations. Furthermore, this algorithm has good scalability and it can be used to process even huge data.
Keywords :
data mining; electronic mail; ontologies (artificial intelligence); pattern clustering; text analysis; Internet; chatting room; data mining; domain ontology; email; hierarchical clustering; large-scale short conversations; text data; very large databases; Buildings; Clustering algorithms; Communications technology; Data mining; Databases; Frequency; Internet; Large-scale systems; Ontologies; Scalability; Domain Ontology; Hierarchical Clustering; Short Conversations;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Computer Science and Computational Technology, 2008. ISCSCT '08. International Symposium on
Conference_Location :
Shanghai
Print_ISBN :
978-1-4244-3746-7
Type :
conf
DOI :
10.1109/ISCSCT.2008.210
Filename :
4731390
Link To Document :
بازگشت