DocumentCode :
1931557
Title :
A framework for hierarchical clustering based indexing in search engines
Author :
Gupta, Parul ; Sharma, A.K.
Author_Institution :
Dept. of Comput. Eng., Y.M.C.A. Univ. of Sci. & Technol., Faridabad, India
fYear :
2010
fDate :
28-30 Oct. 2010
Firstpage :
372
Lastpage :
377
Abstract :
Granting efficient and fast accesses to the index is a key issue for performances of Web Search Engines. In order to enhance memory utilization and favor fast query resolution, WSEs use Inverted File (IF) indexes that consist of an array of the posting lists where each posting list is associated with a term and contains the term as well as the identifiers of the documents containing the term. Since the document identifiers are stored in sorted order, they can be stored as the difference between the successive documents so as to reduce the size of the index. This paper describes a clustering algorithm that aims at partitioning the set of documents into ordered clusters so that the documents within the same cluster are similar and are being assigned the closer document identifiers. Thus the average value of the differences between the successive documents will be minimized and hence storage space would be saved. The paper further presents the extension of this clustering algorithm to be applied for the hierarchical clustering in which similar clusters are clubbed to form a mega cluster and similar mega clusters are then combined to form super cluster. Thus the paper describes the different levels of clustering which optimizes the search process by directing the search to a specific path from higher levels of clustering to the lower levels i.e. from super clusters to mega clusters, then to clusters and finally to the individual documents so that the user gets the best possible matching results in minimum possible time.
Keywords :
document handling; indexing; pattern clustering; search engines; storage management; Web search engines; document identifiers; hierarchical clustering; indexing; inverted file; memory utilization; query resolution; Conferences; Grid computing; Document Identifiers Assignment; Hierarchical Clustering; Index compression; Inverted files;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Parallel Distributed and Grid Computing (PDGC), 2010 1st International Conference on
Conference_Location :
Solan
Print_ISBN :
978-1-4244-7675-6
Type :
conf
DOI :
10.1109/PDGC.2010.5679966
Filename :
5679966
Link To Document :
بازگشت