• DocumentCode
    1697366
  • Title

    Automatic building of Arabic multi dialect text corpora by bootstrapping dialect words

  • Author

    Almeman, K. ; Lee, Minhung

  • Author_Institution
    Sch. of Comput. Sci., Univ. of Birmingham, Birmingham, UK
  • fYear
    2013
  • Firstpage
    1
  • Lastpage
    6
  • Abstract
    The principle objective of this work is to build multi dialect Arabic texts corpora using a web corpus as a resource. A survey has been conducted to categorize distinct words and phrases that are common to a specific dialect only, and not used in other dialects, the purpose being to download a specific dialect text corpus. From this experiment we obtained 48M tokens from different Arabic dialects. These dialects were categorised into four main dialects Gulf, Levantine, Egyptian and North African, resulting in 14.5M, 10.4M, 13M and 10.1M tokens being obtained respectively. The total number of distinct types in all the corpora is 2M types. In this paper we describe how the corpora were constructed by using distinct words.
  • Keywords
    Internet; building management systems; natural language processing; resource allocation; text analysis; word processing; Egyptian dialects; Gulf dialects; Levantine dialects; North African dialects; Web corpus; automatic Arabic multidialect text corpora building; bootstrapping dialect words; distinct words categorization; specific dialect text corpus; Africa; Context; Encoding; Estimation; Feature extraction; Syntactics; Web pages; Automatic Building; Multi Dialect; Text Corpora;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Communications, Signal Processing, and their Applications (ICCSPA), 2013 1st International Conference on
  • Conference_Location
    Sharjah
  • Print_ISBN
    978-1-4673-2820-3
  • Type

    conf

  • DOI
    10.1109/ICCSPA.2013.6487247
  • Filename
    6487247