DocumentCode
1697366
Title
Automatic building of Arabic multi dialect text corpora by bootstrapping dialect words
Author
Almeman, K. ; Lee, Minhung
Author_Institution
Sch. of Comput. Sci., Univ. of Birmingham, Birmingham, UK
fYear
2013
Firstpage
1
Lastpage
6
Abstract
The principle objective of this work is to build multi dialect Arabic texts corpora using a web corpus as a resource. A survey has been conducted to categorize distinct words and phrases that are common to a specific dialect only, and not used in other dialects, the purpose being to download a specific dialect text corpus. From this experiment we obtained 48M tokens from different Arabic dialects. These dialects were categorised into four main dialects Gulf, Levantine, Egyptian and North African, resulting in 14.5M, 10.4M, 13M and 10.1M tokens being obtained respectively. The total number of distinct types in all the corpora is 2M types. In this paper we describe how the corpora were constructed by using distinct words.
Keywords
Internet; building management systems; natural language processing; resource allocation; text analysis; word processing; Egyptian dialects; Gulf dialects; Levantine dialects; North African dialects; Web corpus; automatic Arabic multidialect text corpora building; bootstrapping dialect words; distinct words categorization; specific dialect text corpus; Africa; Context; Encoding; Estimation; Feature extraction; Syntactics; Web pages; Automatic Building; Multi Dialect; Text Corpora;
fLanguage
English
Publisher
ieee
Conference_Titel
Communications, Signal Processing, and their Applications (ICCSPA), 2013 1st International Conference on
Conference_Location
Sharjah
Print_ISBN
978-1-4673-2820-3
Type
conf
DOI
10.1109/ICCSPA.2013.6487247
Filename
6487247
Link To Document