DocumentCode
2337229
Title
A new technique for detecting similar documents based on term co-occurrence and conceptual property of the text
Author
Zamanifar, Azadeh ; Minaei-Bidgoli, Behrouz ; Kashefi, Omid
Author_Institution
Comput. Eng. Dept., Iran Univ. of Sci. & Technol., Tehran
fYear
2008
fDate
13-16 Nov. 2008
Firstpage
526
Lastpage
531
Abstract
The importance of detecting similar documents grows rapidly as the amount of information increases exponentially. This paper presents a new technique for identifying similar documents. It combines statistical properties of documents with Persian linguistic features. The proposed technique is mostly suited for detecting similar documents in specific fields. The proposed method is built on lexical chain of important words and based on term co-occurrence property of the text. It prevents the irrelevant documents to be identified similar due to polysemy property of the words. It also considers the order of words in identifying the similar documents. If a document consists of more than one subject, it could also be founded and similar documents according to different topics of the text could be detected. Our results shows improved performance compared to existing word-based methods like LSI and VSM.
Keywords
object detection; statistical analysis; text analysis; Persian linguistic features; conceptual property; similar document detection; statistical properties; term co-occurrence; Data mining; Databases; Frequency; Indexing; Information retrieval; Large scale integration; Ontologies; Psychology; Statistical analysis; Writing;
fLanguage
English
Publisher
ieee
Conference_Titel
Digital Information Management, 2008. ICDIM 2008. Third International Conference on
Conference_Location
London
Print_ISBN
978-1-4244-2916-5
Electronic_ISBN
978-1-4244-2917-2
Type
conf
DOI
10.1109/ICDIM.2008.4746732
Filename
4746732
Link To Document