Title :
Extracting N-gram terms collocation from tagged Arabic corpus
Author :
Alromima, Waseem ; Moawad, Ibrahim F. ; Elgohary, Rania ; Aref, Mostafa
Author_Institution :
Fac. of Comput. & Inf. Sci., Ain Shams Univ., Cairo, Egypt
Abstract :
Information Extraction (IE) is one of the most important Natural Language Processing (NLP) applications, which extracts information such as Named-Entities (NE) and collocation of terms from the corpus. Collocation is a sequence of terms that co-occur together in the corpus. In Arabic Information Extraction, there are many problems because of the complex of Arabic´s grammar and ambiguity. In general, in linguistics research, the more efficient corpus is the one annotated by Part of Speech Tagging (POST). In this paper, we propose a prototype that extracts collocation of N-gram words (from 2-6 gram) based on the sequence of POST from Arabic Quran corpus. This approach extracts the collocation of N-gram words by matching the input structured pattern of Arabic language versus the Part of Speech Tagging of Quran corpus. The system enables users to select a sequence of tags (2-6 gram) and scope of the corpus source (whole Quran Corpus or specific Surah). To show how the system is beneficial for linguistic research, a set of experiments has been conducted in different scenarios.
Keywords :
computational linguistics; grammars; natural language processing; Arabic Quran corpus; Arabic grammar; Arabic information extraction; Arabic language; N-gram terms collocation; N-gram words; NLP application; POST; corpus source; linguistics research; named-entity; natural language processing application; part of speech tagging; specific Surah; tagged Arabic corpus; Data mining; Educational institutions; Natural language processing; Pattern matching; Pragmatics; Speech; Tagging; Arabic Phrases; Computational linguistics; Information Extraction; Part-of-Speech Tagging (POST); n-gram;
Conference_Titel :
Informatics and Systems (INFOS), 2014 9th International Conference on
Conference_Location :
Cairo
Print_ISBN :
978-977-403-689-7
DOI :
10.1109/INFOS.2014.7036700