• DocumentCode
    3686549
  • Title

    Document clustering based on time series

  • Author

    Liviu Sebastian Matei;Ştefan Trăuşan-Matu

  • Author_Institution
    University Politehnica of Bucharest, Faculty of Automatic Control and Computer Science, Bucharest, Romania
  • fYear
    2015
  • Firstpage
    128
  • Lastpage
    133
  • Abstract
    This paper presents a novel document clustering algorithm that represents documents as a time series of words. Document clustering is very important due to the fact that it permits us to group them based on some certain criteria, especially nowadays when a large number of articles are available. The timed series representation of the document instead of the vector model permits us to consider a new algorithm for the computation of the distance between documents: dynamic time warping. This novel representation together with the dynamic time warping algorithm represents the foundation for computing the similarity and the clustering of the documents. The clustering algorithm used is hierarchical clustering. This novel clustering method of texts is applied on named entities and on the parts of speech of the words that compose the documents. As test data we are using the Reuters corpus of newspaper articles.
  • Keywords
    "Time series analysis","Clustering algorithms","Speech","Heuristic algorithms","Signal processing algorithms","Computational modeling","Algorithm design and analysis"
  • Publisher
    ieee
  • Conference_Titel
    System Theory, Control and Computing (ICSTCC), 2015 19th International Conference on
  • Type

    conf

  • DOI
    10.1109/ICSTCC.2015.7321281
  • Filename
    7321281