• DocumentCode
    3036332
  • Title

    A similarity-based soft clustering algorithm for documents

  • Author

    Lin, King-Ip ; Kondadadi, Ravikumar

  • Author_Institution
    Dept. of Math. Sci., Memphis Univ., Memphis, TN, USA
  • fYear
    2001
  • fDate
    21-21 April 2001
  • Firstpage
    40
  • Lastpage
    47
  • Abstract
    Document clustering is an important tool for applications such as Web search engines. Clustering documents enables the user to have a good overall view of the information contained in the documents that he has. However, existing algorithms suffer from various aspects, hard clustering algorithms (where each document belongs to exactly one cluster) cannot detect the multiple themes of a document, while soft clustering algorithms (where each document can belong to multiple clusters) are usually inefficient. We propose SISC (similarity-based soft clustering), an efficient soft clustering algorithm based on a given similarity measure. SISC requires only a similarity measure for clustering and uses randomization to help make the clustering efficient. Comparison with existing hard clustering algorithms like K-means and its variants shows that SISC is both effective and efficient.
  • Keywords
    data mining; document handling; pattern clustering; very large databases; K-means clustering; SISC; Web search engines; data mining; document clustering; randomization; similarity measure; similarity-based soft clustering; very large databases; Animals; Clustering algorithms; Data mining; Keyword search; Parameter estimation; Search engines; Web pages; Web search; Web sites;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Database Systems for Advanced Applications, 2001. Proceedings. Seventh International Conference on
  • Conference_Location
    Hong Kong, China
  • Print_ISBN
    0-7695-0996-7
  • Type

    conf

  • DOI
    10.1109/DASFAA.2001.916362
  • Filename
    916362