DocumentCode :
2730092
Title :
Organizing Hidden-Web Databases by Clustering Visible Web Documents
Author :
Barbosa, Luciano ; Freire, Juliana ; Silva, Alonso
Author_Institution :
Utah Univ., USA
fYear :
2007
fDate :
15-20 April 2007
Firstpage :
326
Lastpage :
335
Abstract :
In this paper we address the problem of organizing hidden-Web databases. Given a heterogeneous set of Web forms that serve as entry points to hidden-Web databases, our goal is to cluster the forms according to the database domains to which they belong. We propose a new clustering approach that models Web forms as a set of hyperlinked objects and considers visible information in the form context - both within and in the neighborhood of forms - as the basis for similarity comparison. Since the clustering is performed over features that can be automatically extracted, the process is scalable. In addition, because it uses a rich set of metadata, our approach is able to handle a wide range of forms, including content-rich forms that contain multiple attributes, as well as simple keyword-based search interfaces. An experimental evaluation over real Web data shows that our strategy generates high-quality clusters - measured both in terms of entropy and F-measure. This indicates that our approach provides an effective and general solution to the problem of organizing hidden-Web databases.
Keywords :
Internet; database management systems; document handling; pattern clustering; Web document clustering; Web forms; hidden-Web database; hyperlinked objects; Context modeling; Crawlers; Data mining; Entropy; Humans; Information retrieval; Large-scale systems; Organizing; Probes; Spatial databases;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Data Engineering, 2007. ICDE 2007. IEEE 23rd International Conference on
Conference_Location :
Istanbul
Print_ISBN :
1-4244-0802-4
Type :
conf
DOI :
10.1109/ICDE.2007.367878
Filename :
4221681
Link To Document :
بازگشت