Title :
A new technique for detecting similar documents based on term co-occurrence and conceptual property of the text
Author :
Zamanifar, Azadeh ; Minaei-Bidgoli, Behrouz ; Kashefi, Omid
Author_Institution :
Comput. Eng. Dept., Iran Univ. of Sci. & Technol., Tehran
Abstract :
The importance of detecting similar documents grows rapidly as the amount of information increases exponentially. This paper presents a new technique for identifying similar documents. It combines statistical properties of documents with Persian linguistic features. The proposed technique is mostly suited for detecting similar documents in specific fields. The proposed method is built on lexical chain of important words and based on term co-occurrence property of the text. It prevents the irrelevant documents to be identified similar due to polysemy property of the words. It also considers the order of words in identifying the similar documents. If a document consists of more than one subject, it could also be founded and similar documents according to different topics of the text could be detected. Our results shows improved performance compared to existing word-based methods like LSI and VSM.
Keywords :
object detection; statistical analysis; text analysis; Persian linguistic features; conceptual property; similar document detection; statistical properties; term co-occurrence; Data mining; Databases; Frequency; Indexing; Information retrieval; Large scale integration; Ontologies; Psychology; Statistical analysis; Writing;
Conference_Titel :
Digital Information Management, 2008. ICDIM 2008. Third International Conference on
Conference_Location :
London
Print_ISBN :
978-1-4244-2916-5
Electronic_ISBN :
978-1-4244-2917-2
DOI :
10.1109/ICDIM.2008.4746732