Arabic document similarity analysis using n-grams and singular value decomposition

Author

Hussein, Ashraf S.

Author_Institution

Fac. of Inf. Technol. & Comput., Arab Open Univ., Safat, Kuwait

fYear

2015

fDate

13-15 May 2015

Firstpage

445

Lastpage

455

Abstract

The computerized methods for document similarity estimation (or plagiarism detection) in natural languages, evolved during the last two decades, have focused on English language in particular and some other languages such as German and Chinese. On the other hand, there are several language-independent methods, but the accuracy of these methods is not satisfactory, especially with morphological and complicated languages such as Arabic. This paper proposes an innovative content-based method for document similarity analysis devoted to Arabic language in order to bridge the existing gap in such software solutions. The proposed method is based on modeling the relation between documents and their n-gram phrases. These phrases are generated from the normalized text, exploiting Arabic morphology analysis and lexical lookup. Resolving possible morphological ambiguity is carried out through applying Part-of-Speech (PoS) tagging on the examined documents. Text indexing and stop-words removal are performed, employing a new method based on text morphological analysis. The examined documents´ TF-IDF model is constructed using Heuristic based pair-wise matching algorithm, considering lexical and syntactic changes. Then, the hidden associations between the unique n-gram phrases and their documents are investigated using Latent Semantic Analysis (LSA). Next, the pairwise document subset and similarity measures are derived from the Singular Value Decomposition (SVD) computations. The performance of the proposed method was confirmed through experiments with various data sets, exhibiting promising capabilities in estimating literal and some types of intelligent similarities. Finally, the results of the proposed method was compared to that of Plagiarism-Checker-X, and the proposed method outperformed Plagiarism-Checker-X, especially for the intelligent similarity cases with syntactic changes.

Keywords

estimation theory; indexing; natural language processing; singular value decomposition; text analysis; Arabic document similarity analysis; Arabic language; Arabic morphology analysis; Chinese language; English language; German language; Heuristic based pairwise matching algorithm; LSA; PoS tagging; SVD computation; computerized method; content-based method; document similarity estimation; language-independent method; latent semantic analysis; lexical change; lexical lookup; morphological ambiguity; n-gram; natural language; normalized text; pairwise document subset; part-of-speech tagging; plagiarism detection; plagiarism-checker-X; similarity measures; singular value decomposition; software solution; stop-words removal; syntactic change; text indexing; text morphological analysis; Estimation; Indexing; Matrix decomposition; Plagiarism; Semantics; Text analysis; Latent Semantic Analysis; Singular Value Decomposition; natural language processing; plagiarism check; similarity estimation; text mining; text-reuse;

fLanguage

English

Publisher

ieee

Conference_Titel

Research Challenges in Information Science (RCIS), 2015 IEEE 9th International Conference on

Conference_Location

Athens

Type

conf

DOI

10.1109/RCIS.2015.7128906

Filename

7128906

Link To Document

https://search.isc.ac/dl/search/defaultta.aspx?DTC=49&DC=713933