• DocumentCode
    2337229
  • Title

    A new technique for detecting similar documents based on term co-occurrence and conceptual property of the text

  • Author

    Zamanifar, Azadeh ; Minaei-Bidgoli, Behrouz ; Kashefi, Omid

  • Author_Institution
    Comput. Eng. Dept., Iran Univ. of Sci. & Technol., Tehran
  • fYear
    2008
  • fDate
    13-16 Nov. 2008
  • Firstpage
    526
  • Lastpage
    531
  • Abstract
    The importance of detecting similar documents grows rapidly as the amount of information increases exponentially. This paper presents a new technique for identifying similar documents. It combines statistical properties of documents with Persian linguistic features. The proposed technique is mostly suited for detecting similar documents in specific fields. The proposed method is built on lexical chain of important words and based on term co-occurrence property of the text. It prevents the irrelevant documents to be identified similar due to polysemy property of the words. It also considers the order of words in identifying the similar documents. If a document consists of more than one subject, it could also be founded and similar documents according to different topics of the text could be detected. Our results shows improved performance compared to existing word-based methods like LSI and VSM.
  • Keywords
    object detection; statistical analysis; text analysis; Persian linguistic features; conceptual property; similar document detection; statistical properties; term co-occurrence; Data mining; Databases; Frequency; Indexing; Information retrieval; Large scale integration; Ontologies; Psychology; Statistical analysis; Writing;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Digital Information Management, 2008. ICDIM 2008. Third International Conference on
  • Conference_Location
    London
  • Print_ISBN
    978-1-4244-2916-5
  • Electronic_ISBN
    978-1-4244-2917-2
  • Type

    conf

  • DOI
    10.1109/ICDIM.2008.4746732
  • Filename
    4746732