Title :
Self-Similarity Metric for Index Pruning in Conceptual Vector Space Models
Author :
Bonino, Dario ; Corno, Fulvio
Author_Institution :
Politec. di Torino, Turin
Abstract :
One of the critical issues in search engines is the size of search indexes: as the number of documents handled by an engine increases, the search must preserve its efficiency, despite the growth of indexing structures. A widely agreed solution to this problem is the adoption of smaller, or pruned, indexes that allow increasing the retrieval speed while keeping the search quality as high as possible. This paper extends the notion of pruned index to semantic search systems based on conceptual vector space models and proposes a new self-similarity metric for index pruning. A conceptual vector space model represents documents as vectors in a n-dimensional space where each dimension corresponds to an ontology concept. The pruning algorithm proposed in this paper acts on the basis of document self-similarity, preserving only the most significant components of a document conceptual vector. Unlike many already proposed algorithms, the self-similarity metric is only based on local information and does not require to recompute the whole pruned index when new documents are added, i.e., it can be used on-line, possibly combined with other off-line pruning policies. The proposed metric is tested against two benchmark sets respectively related to Siderurgy (250 documents annotated with respect to the e-Class ontology) and Disability (2500 documents annotated with respect to the Passepartout ontology). Results show that the compression ratio achieved by this technique is satisfying (50%), while ranking similarity with results coming from non-pruned indexes remains sufficiently high (80%), thus preserving the quality of provided results.
Keywords :
indexing; information retrieval; ontologies (artificial intelligence); search engines; conceptual vector space models; document conceptual vector; document representation; document self-similarity; e-Class ontology; index pruning; search engines; search indexes; search quality; self-similarity metric; semantic search systems; Benchmark testing; Databases; Expert systems; Extraterrestrial measurements; Frequency; Indexes; Indexing; Information retrieval; Ontologies; Search engines; conceptual vector space model; index pruning; search index;
Conference_Titel :
Database and Expert Systems Application, 2008. DEXA '08. 19th International Workshop on
Conference_Location :
Turin
Print_ISBN :
978-0-7695-3299-8
DOI :
10.1109/DEXA.2008.27