DocumentCode :
676935
Title :
Near real-time thematic clustering of web documents and other internet contents
Author :
Pusztay, Adrian ; Szuley, Janos ; Laki, Sandor
Author_Institution :
Dept. of Phys. of Complex Syst., Eotvos Lorand Univ., Budapest, Hungary
fYear :
2013
fDate :
2-5 Dec. 2013
Firstpage :
307
Lastpage :
312
Abstract :
In the past decade, Internet has radically changed our lives, enabling us to obtain information on everything (disasters, political decisions, ordinary events, etc.) we are interested in in almost real-time. Downloading a web page by a browser, instant messaging or file sharing generate huge amount of network traffic that carry valuable information on the most relevant topics that raise interest in individual users, user groups or the entire society. However, the analysis of this huge amount of unstructured textual data poses many challenges, especially if it is not possible to store the data off-line and real-time clustering is needed. In this paper, we propose a framework for real-time textual content clustering of different sources called documents over the Internet, including posts on Twitter and Facebook, blogs, web sites or other textual contents. To support real-time processing, we extend the spherical on-line K-means clustering algorithm with heuristic improvements: an adaptive dimension reduction technique is introduced to keep the dimension of the document space on a reasonable level, and the ability to open new and remove old clusters according to the actual demand is also added. The performance of our improved algorithm called ASKM (Adaptive Streaming K-Means) has been analyzed on a ground truth data set based on the catalog of Open Directory Project. Furthermore, we also consider a much more realistic scenario where only some parts of the Internet documents are available because of practical limitations of traffic capturing, resulting incomplete textual documents to be clustered. We also show that the proposed method can achieve reasonable good accuracy even in this practical case.
Keywords :
Internet; content management; pattern clustering; social networking (online); text analysis; ASKM; Facebook; Internet contents; Internet document; Open Directory Project; Twitter; Web documents; Web page downloading; Web sites; adaptive dimension reduction technique; adaptive streaming K-means algorithm; blogs; browser; document space dimension; file sharing; instant messaging; near real-time thematic clustering; network traffic; off-line clustering; real-time processing; real-time textual content clustering; spherical on-line K-means clustering algorithm; traffic capturing; unstructured textual data; Blogs; Electronic publishing; Encyclopedias; Internet; Irrigation; Vectors;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Cognitive Infocommunications (CogInfoCom), 2013 IEEE 4th International Conference on
Conference_Location :
Budapest
Print_ISBN :
978-1-4799-1543-9
Type :
conf
DOI :
10.1109/CogInfoCom.2013.6719262
Filename :
6719262
Link To Document :
بازگشت