Title :
A Novel Adjustable Matrix Bloom Filter-Based Copy Detection System for Digital Libraries
Author :
Geravand, Shahabeddin ; Ahmadi, Mahmood
Author_Institution :
Dept. of Comput. Eng., Islamic Azad Univ. of Arak, Arak, Iran
fDate :
Aug. 31 2011-Sept. 2 2011
Abstract :
With the increasing volume of on-line literatures on the Internet and the simplicity of finding and downloading data, dishonest use of the findings of others, known as plagiarism, is getting worse and worse. Therefore, there is a need to be a copy detection system to address this problem in an efficient way. Most current systems only focus on one goal, estimating similarity with highest accuracy, i.e. 100%. While, in some real applications, it can be useful to take into account other factors such as query speed, memory usage and security of content at the cost of reducing accuracy by a few percentages. In this paper, we propose an innovative adjustable copy-paste detection system which provides an adjustable property on mentioned factors according to the application requirements. The main core of our design is a new extension of Bloom filters, called Matrix Bloom Filter (MBF), which provides the adjustability of the system. A matrix Bloom filter is defined as a bit matrix in which each entry can only be set or reset. It is utilized to efficiently maintain all documents of libraries. Based on our knowledge, this is the first work using the idea behind Bloom filters to solve copy-paste detection problem while ensuring the privacy of document content and also the first work aiming to provide this adjustable property. The experimental results show that our proposed approach provides three main improvements, including enhancing the speed of querying operation up to 2.7 times, diminishing the memory required and providing the security of content besides allowing an adjustable trade-off among all aforesaid factors.
Keywords :
data structures; digital libraries; matrix algebra; Internet; MBF; adjustable matrix bloom filter based copy detection system; copy paste detection system; data downloading; digital libraries; memory usage; Accuracy; Copper; Databases; Estimation; Memory management; Plagiarism; Security; Plagiarism; chunking; copy-paste detection; cosine similarity measure; hash function; matrix Bloom filter;
Conference_Titel :
Computer and Information Technology (CIT), 2011 IEEE 11th International Conference on
Conference_Location :
Pafos
Print_ISBN :
978-1-4577-0383-6
Electronic_ISBN :
978-0-7695-4388-8
DOI :
10.1109/CIT.2011.61