• DocumentCode
    607338
  • Title

    Text similarity calculations using text and syntactical structures

  • Author

    Elhadi, M.T.

  • Author_Institution
    Dept. of Comput. Sci., Univ. of Zawia, Zawia, Libya
  • fYear
    2012
  • fDate
    3-5 Dec. 2012
  • Firstpage
    715
  • Lastpage
    719
  • Abstract
    This paper reports on experiments performed to investigate the use of syntactical structures of sentences as the basis of similarity calculation between two text documents. Sentences of the documents are converted into an ordered Part of Speech (POS) tags that are then fed to Longest Common Subsequence (LCS) algorithm to determine the size and count of the LCSs found when comparing the document sentence by sentence. In the first stage the syntactical features of the text were used as a structural representation of the document´s text. It also serves as a text reduction to improve the efficiency of the LCS when used in comparing. In the second stage, documents that score well in the first stage as measured by computing an accumulative score that is a function of the number of the LCSs, are then subjects to further comparison using the actual sentences (content words) in a sentence by sentence fashion to produce a final measure of similarity based on common words (accumulated for the whole file) and the total number of LCSs from the first step. Experiments done on two different corpuses and results obtained have showed the utility of the proposed procedure in calculating similarities between written documents.
  • Keywords
    text analysis; LCS algorithm; POS tags; common words; document sentence; longest common subsequence algorithm; part of speech tags; structural representation; syntactical structures; text documents; text reduction; text similarity calculations; text structures; written documents;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Computing and Convergence Technology (ICCCT), 2012 7th International Conference on
  • Conference_Location
    Seoul
  • Print_ISBN
    978-1-4673-0894-6
  • Type

    conf

  • Filename
    6530427