• DocumentCode
    2457709
  • Title

    On Text Clustering with Side Information

  • Author

    Aggarwal, Charu C. ; Zhao, Yuchen ; Yu, Philip S.

  • Author_Institution
    IBM T. J. Watson Res. Center, Hawthorne, NY, USA
  • fYear
    2012
  • fDate
    1-5 April 2012
  • Firstpage
    894
  • Lastpage
    904
  • Abstract
    Text clustering has become an increasingly important problem in recent years because of the tremendous amount of unstructured data which is available in various forms in online forums such as the web, social networks, and other information networks. In most cases, the data is not purely available in text form. A lot of side-information is available along with the text documents. Such side-information may be of different kinds, such as the links in the document, user-access behavior from web logs, or other non-textual attributes which are embedded into the text document. Such attributes may contain a tremendous amount of information for clustering purposes. However, the relative importance of this side-information may be difficult to estimate, especially when some of the information is noisy. In such cases, it can be risky to incorporate side-information into the clustering process, because it can either improve the quality of the representation for clustering, or can add noise to the process. Therefore, we need a principled way to perform the clustering process, so as to maximize the advantages from using this side information. In this paper, we design an algorithm which combines classical partitioning algorithms with probabilistic models in order to create an effective clustering approach. We present experimental results on a number of real data sets in order to illustrate the advantages of using such an approach.
  • Keywords
    Internet; pattern clustering; probability; social networking (online); text analysis; Web logs; classical partitioning algorithms; information networks; nontextual attributes; online forums; probabilistic models; side information; social networks; text clustering; text documents; unstructured data; user-access behavior; Approximation methods; Clustering algorithms; Coherence; Context; Noise measurement; Partitioning algorithms; Probabilistic logic;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Data Engineering (ICDE), 2012 IEEE 28th International Conference on
  • Conference_Location
    Washington, DC
  • ISSN
    1063-6382
  • Print_ISBN
    978-1-4673-0042-1
  • Type

    conf

  • DOI
    10.1109/ICDE.2012.111
  • Filename
    6228142