Title :
Detection of Verbatim or Partial Duplication from Multiple Source Documents Using Data Mining Techniques and Case-Based Reasoning Methodologies
Author :
Chaudhuri, Chitrita ; Chaudhuri, Atal
Author_Institution :
Dept. of Comput. Sci. & Eng., Jadavpur Univ., Kolkata, India
Abstract :
This paper aims to specify a Case-Based Reasoning strategy for correctly classifying, storing and preventing duplication efforts of electronic text material. Preservation of complete source documents for checking similarity between them pose a daunting amount of spatial and computational complexity to researchers in this area. The problem is partially solved by applying certain preprocessing steps to reduce the volume of data handling substantially. Reduction of volume in text documents is achieved by applying some stemming algorithm and elimination of stop words from the document utilizing certain text-mining measures such as TF-IDF. A third technique involves extraction of keywords and storing them in a properly indexed base. These then can serve the dual purpose of providing solutions to Lazy Learning classification for automatic subject-wise archiving and formation of relevant word sequences for detection of plagiarism using Association Rule-mining techniques.
Keywords :
case-based reasoning; data mining; information retrieval; learning (artificial intelligence); pattern classification; reproduction (copying); text analysis; TF-IDF; association rule mining; automatic subject wise archiving; case based reasoning; data handling; data mining; electronic text material; keyword extraction; lazy learning classification; multiple source document; partial duplication; plagiarism detection; similarity check; stemming algorithm; stopword elimination; text document; text mining measure; verbatim detection; word sequences; Algorithm design and analysis; Classification algorithms; Cognition; Data mining; Frequency conversion; Plagiarism; Time frequency analysis; Association Rule-mining techniques; Case-BasedReasoning strategies; Plagiarism; TF-IDF;
Conference_Titel :
Emerging Applications of Information Technology (EAIT), 2011 Second International Conference on
Conference_Location :
Kolkata
Print_ISBN :
978-1-4244-9683-9
DOI :
10.1109/EAIT.2011.31