DocumentCode :
238144
Title :
Finding similar items using LSH and Bloom Filter
Author :
Chauhan, Sachendra Singh ; Batra, Shalini
Author_Institution :
Comput. Sci. & Eng., Thapar Univ., Patiala, India
fYear :
2014
fDate :
8-10 May 2014
Firstpage :
1662
Lastpage :
1666
Abstract :
Similarity search of text documents can be reduced to Approximate nearest Neighbor Search by converting text documents into sets by using Shingling. In the proposed scheme conversion of a document into set is done by using k-shingle and similarity between sets, is calculated using Jaccard similarity. Characteristic matrix, created by searching shingles in each document is generated using the shingles and hash functions. Since the complexity of the matrix is significantly high when shingle set is very large, a technique called `minhashing´ is used to reduce the size of the matrix. Search time can be drastically reduced if characteristics matrix are created by using Bloom Filter, a probabilistic data structure that will reduce O(n) search time to constant time. Minhashing creates a signature matrix which is very less in size as compared with characteristics matrix but gives almost same result. Further, finding the similarity among all pairs of column of signature matrix is still a big problem because comparing n columns take O(n2) time. The time of comparing columns for similarity is reduced by using another technique, Locality Sensitive Hashing. The proposed scheme has been successfully implemented and comparative analysis of standard methodology and proposed methodology has been provided.
Keywords :
computational complexity; data structures; pattern matching; text analysis; Bloom filter; Jaccard similarity; LSH; approximate nearest neighbor search; characteristics matrix; k-shingle; locality sensitive hashing; minhashing; probabilistic data structure; search time reduction; shingling; signature matrix; text documents similarity search; Algorithm design and analysis; Approximation algorithms; Computers; Conferences; Filtering algorithms; Filtering theory; Nearest neighbor searches; Bloom Filter; Characteristic Matrix; Locality Sensitive Hashing; Signature Matrix;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Advanced Communication Control and Computing Technologies (ICACCCT), 2014 International Conference on
Conference_Location :
Ramanathapuram
Print_ISBN :
978-1-4799-3913-8
Type :
conf
DOI :
10.1109/ICACCCT.2014.7019390
Filename :
7019390
Link To Document :
بازگشت