Title :
Stochastic Arabic hybrid diacritizer
Author :
Rashwan, Mohsen ; Attia, Mohamed ; Abdou, Sherif ; Abdou, S. ; Rafea, Ahmed
Author_Institution :
Dept. of Electron. & Electr. Commun., Cairo Univ., Cairo, Egypt
Abstract :
This paper introduces a two-layer stochastic system to diacritize raw Arabic text automatically. The first layer determines the most likely diacritics by choosing the sequence of full-form Arabic word diacritizations with maximum marginal probability via A* lattice search algorithm and m-gram probability estimation. When full-form words are out-of-vocabulary (OOV), the system utilizes a second layer, which factorizes each Arabic word into its possible morphological constituents (prefix, root, pattern and suffix), then uses m-gram probability estimation and A* lattice search algorithm to select among the possible factorizations to get the most likely diacritization sequence. While the second layer has better coverage of possible Arabic forms, the first layer yields better disambiguation results for the same size of training corpora, especially for inferring syntactical (case-end) diacritics. The presented hybrid system possesses the advantages of both layers. The paper details the workings of both layers and the architecture of the hybrid system. By comparing our proposed system with the best performing system to our knowledge of Habash et al. using their training and testing corpus; it is found that the word error rates of 5.5% for the morphological diacritization and 9.4% for the syntactic diacritization by Habash et al., and only 3.1% for the morphological diacritization and 9.4% for the syntactic diacritization by our system.
Keywords :
learning (artificial intelligence); natural language processing; probability; search problems; stochastic processes; text analysis; A* lattice search algorithm; hybrid system; m-gram probability estimation; machine learning; maximum marginal probability; morphological constituent; morphological diacritization; out-of-vocabulary; stochastic Arabic hybrid diacritizer; syntactic diacritization; text analysis; Computer science; Lattices; Morphology; Speech synthesis; Stochastic processes; Stochastic systems; System testing; Tagging; Training data; Vocabulary;
Conference_Titel :
Natural Language Processing and Knowledge Engineering, 2009. NLP-KE 2009. International Conference on
Conference_Location :
Dalian
Print_ISBN :
978-1-4244-4538-7
Electronic_ISBN :
978-1-4244-4540-0
DOI :
10.1109/NLPKE.2009.5313742