Automatic building of Arabic multi dialect text corpora by bootstrapping dialect words

Author

Almeman, K. ; Lee, Minhung

Author_Institution

Sch. of Comput. Sci., Univ. of Birmingham, Birmingham, UK

fYear

2013

Firstpage

1

Lastpage

6

Abstract

The principle objective of this work is to build multi dialect Arabic texts corpora using a web corpus as a resource. A survey has been conducted to categorize distinct words and phrases that are common to a specific dialect only, and not used in other dialects, the purpose being to download a specific dialect text corpus. From this experiment we obtained 48M tokens from different Arabic dialects. These dialects were categorised into four main dialects Gulf, Levantine, Egyptian and North African, resulting in 14.5M, 10.4M, 13M and 10.1M tokens being obtained respectively. The total number of distinct types in all the corpora is 2M types. In this paper we describe how the corpora were constructed by using distinct words.

Keywords

Internet; building management systems; natural language processing; resource allocation; text analysis; word processing; Egyptian dialects; Gulf dialects; Levantine dialects; North African dialects; Web corpus; automatic Arabic multidialect text corpora building; bootstrapping dialect words; distinct words categorization; specific dialect text corpus; Africa; Context; Encoding; Estimation; Feature extraction; Syntactics; Web pages; Automatic Building; Multi Dialect; Text Corpora;

fLanguage

English

Publisher

ieee

Conference_Titel

Communications, Signal Processing, and their Applications (ICCSPA), 2013 1st International Conference on

Conference_Location

Sharjah

Print_ISBN

978-1-4673-2820-3

Type

conf

DOI

10.1109/ICCSPA.2013.6487247

Filename

6487247