Title :
Multi-lingual author identification and linguistic feature extraction — A machine learning approach
Author :
Alam, Hassan ; Kumar, Ajit
Author_Institution :
BCL Technol., San Jose, CA, USA
Abstract :
Internet based services have emerged as one of the most effective platform to express and exchange views. Most of these services allow anonymous postings. Lately, it has been observed that anonymous postings responsible to instigate violence or cause panic. Some studies have been made to identify authors for such blogs, mostly target to English postings. Current author identification systems do not employ rich morphological features for languages such as Arabic (Modern Standard Arabic). In this study we develop a novel semantic feature to aid author identification system for Arabic. To completely exploit rich morphology of Arabic, we used parse tree intelligently as features. The overall approach uses language-specific NLP parsers, lexicons, semantic processing, thematic role assignment, semantic heuristics, and machine learning techniques to rapidly train systems for the subtleties mentioned above for Arabic. Our system identifies authors on the basis of stylistic and linguistic similarities between the author´s existing works and the unidentified text in the form of online blogs and articles. We use support vector machine (SVM) to identify authors based on these novel features. Our approach yields accuracy of 98% in law and order and terrorism related Arabic blogs.
Keywords :
Web sites; feature extraction; learning (artificial intelligence); natural language processing; support vector machines; Arabic blogs; Arabic languages; English postings; Internet based services; SVM; anonymous postings; language-specific NLP parsers; lexicons; linguistic feature extraction; linguistic similarities; machine learning approach; morphological features; multilingual author identification; natural language processing; parse tree; semantic feature; semantic heuristics; semantic processing; stylistic similarities; support vector machine; thematic role assignment; Blogs; Feature extraction; Labeling; Pragmatics; Semantics; Support vector machines; Syntactics; Author Identification; Feature Extraction; NLP; Semantic Features; Support Vector Machine;
Conference_Titel :
Technologies for Homeland Security (HST), 2013 IEEE International Conference on
Conference_Location :
Waltham, MA
Print_ISBN :
978-1-4799-3963-3
DOI :
10.1109/THS.2013.6699035