• DocumentCode
    3207031
  • Title

    Affinity-based similarity measure for Web document clustering

  • Author

    Shyu, Mei-Ling ; Chen, Shu-Ching ; Chen, Min ; Rubin, Stuart H.

  • Author_Institution
    Dept. of Electr. & Comput. Eng., Miami Univ., Coral Gables, FL, USA
  • fYear
    2004
  • fDate
    8-10 Nov. 2004
  • Firstpage
    247
  • Lastpage
    252
  • Abstract
    Compared to the regular documents, the major distinguishing characteristics of the Web documents are the dynamic hyper-structure. Thus, in addition to terms or keywords for regular document clustering, Web document clustering can incorporate some dynamic information such as the hyperlinks and the access patterns extracted from the user query logs. In this paper, we extend the concept of document clustering into Web document clustering by introducing the strategy of affinity-based similarity measure, which utilizes the user access patterns in determining the similarities among Web documents via a probabilistic model. Several comparison experiments are conducted using a real data set and the experimental results demonstrate that the proposed similarity measure outperforms the cosine coefficient and the Euclidean distance method under different document clustering algorithms.
  • Keywords
    Internet; data mining; document handling; information retrieval; Euclidean distance method; Web document clustering; affinity-based similarity measure; cosine coefficient; document retrieval; hyperlinks; probabilistic model; user access patterns; user query logs; Clustering algorithms; Distributed computing; Information systems; Laboratories; Military computing; Multimedia systems; Particle measurements; Systems engineering and theory; Uniform resource locators; World Wide Web;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Information Reuse and Integration, 2004. IRI 2004. Proceedings of the 2004 IEEE International Conference on
  • Print_ISBN
    0-7803-8819-4
  • Type

    conf

  • DOI
    10.1109/IRI.2004.1431469
  • Filename
    1431469