DocumentCode :
2850899
Title :
The anatomy of a hierarchical clustering engine for Web-page, news and book snippets
Author :
Ferragina, Paolo ; Gullí, Antonio
Author_Institution :
Dipt. di Informatica, Universita di Pisa, Italy
fYear :
2004
fDate :
1-4 Nov. 2004
Firstpage :
395
Lastpage :
398
Abstract :
In this paper, we investigate the Web snippet hierarchical clustering problem in its full extent by devising an algorithmic solution, and a software prototype called SnakeT (accessible at http://roquefort.di.unipi.it/), that: (1) draws the snippets from 16 Web search engines, the Amazon collection of books a9.com, the news of Google News and the blogs of Blogline; (2) builds the clusters on-the-fly (ephemeral clustering (Maarek et al., 2000)) in response to a user query without adopting any predefined organization in categories; (3) labels the clusters with sentences of variable length, drawn from the snippets and possibly missing some terms, provided they are not too many; (4) uses some ranking functions which exploit two knowledge bases properly built by our engine at preprocessing time for the sentences selection and cluster-assignment process; (5) organizes the clusters into a hierarchy, and assigns to the nodes intelligible sentences in order to allow post-navigation for query refinement. Our clustering algorithm possibly let the clusters overlap at different levels of the hierarchy.
Keywords :
Web sites; information retrieval; knowledge based systems; pattern clustering; search engines; Amazon collection; Blogline; Google News; SnakeT; Web page; Web search engines; book snippets; cluster assignment; clusters on-the-fly; ephemeral clustering; hierarchical clustering engine; knowledge bases; query refinement; ranking functions; sentences selection; Anatomy; Blogs; Books; Clustering algorithms; Data mining; Search engines; Software algorithms; Software architecture; Surges; Web pages;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Data Mining, 2004. ICDM '04. Fourth IEEE International Conference on
Print_ISBN :
0-7695-2142-8
Type :
conf
DOI :
10.1109/ICDM.2004.10027
Filename :
1410319
Link To Document :
بازگشت