DocumentCode :
1697366
Title :
Automatic building of Arabic multi dialect text corpora by bootstrapping dialect words
Author :
Almeman, K. ; Lee, Minhung
Author_Institution :
Sch. of Comput. Sci., Univ. of Birmingham, Birmingham, UK
fYear :
2013
Firstpage :
1
Lastpage :
6
Abstract :
The principle objective of this work is to build multi dialect Arabic texts corpora using a web corpus as a resource. A survey has been conducted to categorize distinct words and phrases that are common to a specific dialect only, and not used in other dialects, the purpose being to download a specific dialect text corpus. From this experiment we obtained 48M tokens from different Arabic dialects. These dialects were categorised into four main dialects Gulf, Levantine, Egyptian and North African, resulting in 14.5M, 10.4M, 13M and 10.1M tokens being obtained respectively. The total number of distinct types in all the corpora is 2M types. In this paper we describe how the corpora were constructed by using distinct words.
Keywords :
Internet; building management systems; natural language processing; resource allocation; text analysis; word processing; Egyptian dialects; Gulf dialects; Levantine dialects; North African dialects; Web corpus; automatic Arabic multidialect text corpora building; bootstrapping dialect words; distinct words categorization; specific dialect text corpus; Africa; Context; Encoding; Estimation; Feature extraction; Syntactics; Web pages; Automatic Building; Multi Dialect; Text Corpora;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Communications, Signal Processing, and their Applications (ICCSPA), 2013 1st International Conference on
Conference_Location :
Sharjah
Print_ISBN :
978-1-4673-2820-3
Type :
conf
DOI :
10.1109/ICCSPA.2013.6487247
Filename :
6487247
Link To Document :
بازگشت