مرکز منطقه ای اطلاع رساني علوم و فناوري - Deep Learning Framework with Confused Sub-Set Resolution Architecture for Automatic Arabic Diacritization

DocumentCode :

82072

Title :

Deep Learning Framework with Confused Sub-Set Resolution Architecture for Automatic Arabic Diacritization

Author :

Rashwan, Mohsen A. A. ; Al Sallab, Ahmad A. ; Raafat, Hazem M. ; Rafea, Ahmed

Author_Institution :

Eng. Co., Dev. of Comput. Syst. (RDI), Giza, Egypt

Volume :

Issue :

fYear :

2015

fDate :

Mar-15

Firstpage :

505

Lastpage :

516

Abstract :

The Arabic language belongs to a group of languages that require diacritization over their characters. Modern Standard Arabic (MSA) transcripts omit the diacritics, which are essential for many machine learning tasks like Text-To-Speech (TTS) systems. In this work Arabic diacritics restoration is tackled under a deep learning framework that includes the Confused Sub-set Resolution (CSR) method to improve the classification accuracy, in addition to an Arabic Part-of-Speech (PoS) tagging framework using deep neural nets. Special focus is given to syntactic diacritization, which still suffers low accuracy as indicated in prior works. Evaluation is done versus state-of-the-art systems reported in literature, with quite challenging datasets collected from different domains. Standard datasets like the LDC Arabic Tree Bank are used in addition to custom ones we have made available online to allow other researchers to replicate these results. Results show significant improvement of the proposed techniques over other approaches, reducing the syntactic classification error to 9.9% and morphological classification error to 3% compared to 12.7% and 3.8% of the best reported results in literature, improving the error by 22% over the best reported systems.

Keywords :

learning (artificial intelligence); natural language processing; neural nets; speech synthesis; Arabic diacritics restoration; Arabic language; Arabic part-of-speech tagging; PoS tagging; TTS system; automatic Arabic diacritization; confused subset resolution architecture; deep learning; deep neural nets; machine learning; modern standard Arabic transcript; syntactic diacritization; text-to-speech system; Accuracy; Context; Feature extraction; Standards; Syntactics; Training; Vectors; Arabic diacritization; classifier design; deep networks; part-of-speech (PoS) tagging;

fLanguage :

English

Journal_Title :

Audio, Speech, and Language Processing, IEEE/ACM Transactions on

Publisher :

ieee

ISSN :

2329-9290

Type :

jour

DOI :

10.1109/TASLP.2015.2395255

Filename :

7050392

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=82072