DocumentCode :
713933
Title :
Arabic document similarity analysis using n-grams and singular value decomposition
Author :
Hussein, Ashraf S.
Author_Institution :
Fac. of Inf. Technol. & Comput., Arab Open Univ., Safat, Kuwait
fYear :
2015
fDate :
13-15 May 2015
Firstpage :
445
Lastpage :
455
Abstract :
The computerized methods for document similarity estimation (or plagiarism detection) in natural languages, evolved during the last two decades, have focused on English language in particular and some other languages such as German and Chinese. On the other hand, there are several language-independent methods, but the accuracy of these methods is not satisfactory, especially with morphological and complicated languages such as Arabic. This paper proposes an innovative content-based method for document similarity analysis devoted to Arabic language in order to bridge the existing gap in such software solutions. The proposed method is based on modeling the relation between documents and their n-gram phrases. These phrases are generated from the normalized text, exploiting Arabic morphology analysis and lexical lookup. Resolving possible morphological ambiguity is carried out through applying Part-of-Speech (PoS) tagging on the examined documents. Text indexing and stop-words removal are performed, employing a new method based on text morphological analysis. The examined documents´ TF-IDF model is constructed using Heuristic based pair-wise matching algorithm, considering lexical and syntactic changes. Then, the hidden associations between the unique n-gram phrases and their documents are investigated using Latent Semantic Analysis (LSA). Next, the pairwise document subset and similarity measures are derived from the Singular Value Decomposition (SVD) computations. The performance of the proposed method was confirmed through experiments with various data sets, exhibiting promising capabilities in estimating literal and some types of intelligent similarities. Finally, the results of the proposed method was compared to that of Plagiarism-Checker-X, and the proposed method outperformed Plagiarism-Checker-X, especially for the intelligent similarity cases with syntactic changes.
Keywords :
estimation theory; indexing; natural language processing; singular value decomposition; text analysis; Arabic document similarity analysis; Arabic language; Arabic morphology analysis; Chinese language; English language; German language; Heuristic based pairwise matching algorithm; LSA; PoS tagging; SVD computation; computerized method; content-based method; document similarity estimation; language-independent method; latent semantic analysis; lexical change; lexical lookup; morphological ambiguity; n-gram; natural language; normalized text; pairwise document subset; part-of-speech tagging; plagiarism detection; plagiarism-checker-X; similarity measures; singular value decomposition; software solution; stop-words removal; syntactic change; text indexing; text morphological analysis; Estimation; Indexing; Matrix decomposition; Plagiarism; Semantics; Text analysis; Latent Semantic Analysis; Singular Value Decomposition; natural language processing; plagiarism check; similarity estimation; text mining; text-reuse;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Research Challenges in Information Science (RCIS), 2015 IEEE 9th International Conference on
Conference_Location :
Athens
Type :
conf
DOI :
10.1109/RCIS.2015.7128906
Filename :
7128906
Link To Document :
بازگشت