• DocumentCode
    589261
  • Title

    Scalable Overlapping Co-clustering of Word-Document Data

  • Author

    Franca, F.O.D.

  • Author_Institution
    Center of Math., Comput. & Cognition (CMCC), Fed. Univ. of ABC (UFABC), Santo Andre, Brazil
  • Volume
    1
  • fYear
    2012
  • fDate
    12-15 Dec. 2012
  • Firstpage
    464
  • Lastpage
    467
  • Abstract
    Text clustering is used on a variety of applications such as content-based recommendation, categorization, summarization, information retrieval and automatic topic extraction. Since most pair of documents usually shares just a small percentage of words, the dataset representation tends to become very sparse, thus the need of using a similarity metric capable of a partial matching of a set of features. The technique known as Co-Clustering is capable of finding several clusters inside a dataset with each cluster composed of just a subset of the object and feature sets. In word-document data this can be useful to identify the clusters of documents pertaining to the same topic, even though they share just a small fraction of words. In this paper a scalable co-clustering algorithm is proposed using the Locality-sensitive hashing technique in order to find co-clusters of documents. The proposed algorithm will be tested against other co-clustering and traditional algorithms in well known datasets. The results show that this algorithm is capable of finding clusters more accurately than other approaches while maintaining a linear complexity.
  • Keywords
    data structures; pattern clustering; text analysis; dataset representation; locality-sensitive hashing technique; scalable overlapping coclustering; text clustering; word-document data clustering; Accuracy; Clustering algorithms; Complexity theory; Feature extraction; Machine learning; Mutual information; Text mining; co-clustering; hashing; text clustering;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Machine Learning and Applications (ICMLA), 2012 11th International Conference on
  • Conference_Location
    Boca Raton, FL
  • Print_ISBN
    978-1-4673-4651-1
  • Type

    conf

  • DOI
    10.1109/ICMLA.2012.84
  • Filename
    6406666