• DocumentCode
    713933
  • Title

    Arabic document similarity analysis using n-grams and singular value decomposition

  • Author

    Hussein, Ashraf S.

  • Author_Institution
    Fac. of Inf. Technol. & Comput., Arab Open Univ., Safat, Kuwait
  • fYear
    2015
  • fDate
    13-15 May 2015
  • Firstpage
    445
  • Lastpage
    455
  • Abstract
    The computerized methods for document similarity estimation (or plagiarism detection) in natural languages, evolved during the last two decades, have focused on English language in particular and some other languages such as German and Chinese. On the other hand, there are several language-independent methods, but the accuracy of these methods is not satisfactory, especially with morphological and complicated languages such as Arabic. This paper proposes an innovative content-based method for document similarity analysis devoted to Arabic language in order to bridge the existing gap in such software solutions. The proposed method is based on modeling the relation between documents and their n-gram phrases. These phrases are generated from the normalized text, exploiting Arabic morphology analysis and lexical lookup. Resolving possible morphological ambiguity is carried out through applying Part-of-Speech (PoS) tagging on the examined documents. Text indexing and stop-words removal are performed, employing a new method based on text morphological analysis. The examined documents´ TF-IDF model is constructed using Heuristic based pair-wise matching algorithm, considering lexical and syntactic changes. Then, the hidden associations between the unique n-gram phrases and their documents are investigated using Latent Semantic Analysis (LSA). Next, the pairwise document subset and similarity measures are derived from the Singular Value Decomposition (SVD) computations. The performance of the proposed method was confirmed through experiments with various data sets, exhibiting promising capabilities in estimating literal and some types of intelligent similarities. Finally, the results of the proposed method was compared to that of Plagiarism-Checker-X, and the proposed method outperformed Plagiarism-Checker-X, especially for the intelligent similarity cases with syntactic changes.
  • Keywords
    estimation theory; indexing; natural language processing; singular value decomposition; text analysis; Arabic document similarity analysis; Arabic language; Arabic morphology analysis; Chinese language; English language; German language; Heuristic based pairwise matching algorithm; LSA; PoS tagging; SVD computation; computerized method; content-based method; document similarity estimation; language-independent method; latent semantic analysis; lexical change; lexical lookup; morphological ambiguity; n-gram; natural language; normalized text; pairwise document subset; part-of-speech tagging; plagiarism detection; plagiarism-checker-X; similarity measures; singular value decomposition; software solution; stop-words removal; syntactic change; text indexing; text morphological analysis; Estimation; Indexing; Matrix decomposition; Plagiarism; Semantics; Text analysis; Latent Semantic Analysis; Singular Value Decomposition; natural language processing; plagiarism check; similarity estimation; text mining; text-reuse;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Research Challenges in Information Science (RCIS), 2015 IEEE 9th International Conference on
  • Conference_Location
    Athens
  • Type

    conf

  • DOI
    10.1109/RCIS.2015.7128906
  • Filename
    7128906