• DocumentCode
    2678331
  • Title

    K-Means for Search Results Clustering Using URL and Tag Contents

  • Author

    Poomagal, S. ; Hamsapriya, T.

  • Author_Institution
    Dept. of Comput. & Inf. Sci., PSG Coll. of Technol., Coimbatore, India
  • fYear
    2011
  • fDate
    20-22 July 2011
  • Firstpage
    1
  • Lastpage
    7
  • Abstract
    Increasing volume of web has resulted in the flooding of huge collection of web documents in search results creating difficulty for the user to browse the necessary document. Clustering is a solution to organize search results in a better way for browsing. It is a process of combining similar web documents into groups. For web page clustering, terms (features) can be extracted from different parts of a web page. Giansalvatore, Salvatore and Alessandro have extracted terms from entire web page for clustering Stanis law Osinski et al., have considered terms only from snippets. A new method is introduced in this paper which extract terms from URL, Title tag and Meta tag to produce clusters of web documents. The reason for selecting these parts of a web page is that they contain keywords which are available in a web page. Clustering algorithm used in this paper is K-means. Proposed method of clustering is compared with snippet based clustering in terms of intra-cluster distance and inter-cluster distance.
  • Keywords
    Web sites; document handling; feature extraction; information retrieval; pattern clustering; search problems; URL; Web documents; Web page; feature extraction; k-means clustering; meta tag; search result clustering; snippet based clustering; tag content; title tag; Clustering algorithms; Ear; Feature extraction; Frequency measurement; Partitioning algorithms; Search engines; Web pages;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Process Automation, Control and Computing (PACC), 2011 International Conference on
  • Conference_Location
    Coimbatore
  • Print_ISBN
    978-1-61284-765-8
  • Type

    conf

  • DOI
    10.1109/PACC.2011.5978906
  • Filename
    5978906