DocumentCode
3686549
Title
Document clustering based on time series
Author
Liviu Sebastian Matei;Ştefan Trăuşan-Matu
Author_Institution
University Politehnica of Bucharest, Faculty of Automatic Control and Computer Science, Bucharest, Romania
fYear
2015
Firstpage
128
Lastpage
133
Abstract
This paper presents a novel document clustering algorithm that represents documents as a time series of words. Document clustering is very important due to the fact that it permits us to group them based on some certain criteria, especially nowadays when a large number of articles are available. The timed series representation of the document instead of the vector model permits us to consider a new algorithm for the computation of the distance between documents: dynamic time warping. This novel representation together with the dynamic time warping algorithm represents the foundation for computing the similarity and the clustering of the documents. The clustering algorithm used is hierarchical clustering. This novel clustering method of texts is applied on named entities and on the parts of speech of the words that compose the documents. As test data we are using the Reuters corpus of newspaper articles.
Keywords
"Time series analysis","Clustering algorithms","Speech","Heuristic algorithms","Signal processing algorithms","Computational modeling","Algorithm design and analysis"
Publisher
ieee
Conference_Titel
System Theory, Control and Computing (ICSTCC), 2015 19th International Conference on
Type
conf
DOI
10.1109/ICSTCC.2015.7321281
Filename
7321281
Link To Document