• DocumentCode
    3286406
  • Title

    Towards effective processing of large text collections

  • Author

    Szymanski, Janusz ; Krawczyk, Harald

  • Author_Institution
    Dept. of Electron., Telecommun. & Inf., Gdansk Univ. of Technol., Gdańsk, Poland
  • fYear
    2012
  • fDate
    18-20 Sept. 2012
  • Firstpage
    265
  • Lastpage
    270
  • Abstract
    In the article we describe the approach to parallel implementation of elementary operations for textual data categorization. In the experiments we evaluate parallel computations of similarity matrices and k-means algorithm. The test datasets have been prepared as graphs created from Wikipedia articles related with links. When we create the clustering data packages, we compute pairs of eigenvectors and eigenvalues for visualizations of the datasets. We describe the method used for evaluation of the clustering quality. Finally we discuss achieved results, point some improvements and perspectives for future development.
  • Keywords
    Web sites; data visualisation; eigenvalues and eigenfunctions; matrix algebra; pattern clustering; text analysis; Wikipedia articles; clustering data packages; clustering quality; dataset visualizations; eigenvalues; eigenvectors; elementary operations; graphs; k-means algorithm; large text collections; parallel computations; similarity matrices; textual data categorization; PCA; documents categorization; text clustering;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Innovative Computing Technology (INTECH), 2012 Second International Conference on
  • Conference_Location
    Casablanca
  • Print_ISBN
    978-1-4673-2678-0
  • Type

    conf

  • DOI
    10.1109/INTECH.2012.6457784
  • Filename
    6457784