• DocumentCode
    1908463
  • Title

    New Language Resources for Arabic: Corpus Containing More Than Two Million Words and a Corpus Processing Tool

  • Author

    Al-Thubaity, Abdulmohsen ; Khan, Mahrukh ; Al-Mazrua, Manal ; Al-Mousa, Maram

  • Author_Institution
    Comput. Res. Inst., King Abdulaziz City for Sci. & Technol., Riyadh, Saudi Arabia
  • fYear
    2013
  • fDate
    17-19 Aug. 2013
  • Firstpage
    67
  • Lastpage
    70
  • Abstract
    Arabic is a resource-poor language relative to other languages with a similar number of speakers. This situation negatively affects corpus-based linguistic studies in Arabic and, to a lesser extent, Arabic language processing. This paper presents a brief overview of recent freely available Arabic corpora and corpora processing tools, and it examines some of the issues that may be preventing Arabic linguists from using the same. These issues reveal the need for new language resources to enrich and foster Arabic corpus-based studies. Accordingly, this paper introduces the design of a new Arabic corpus that includes modern standard Arabic varieties based on newspapers from all Arab countries and that comprises more than two million words, it also describes the main features of a corpus processing tool specifically designed for Arabic, called "Khawas ÛæÇÕ" ("diver" in English). Khawas provides more features than any other freely available corpus processing tool for Arabic, including n-gram frequency and concordance, collocations, and statistical comparison of two corpora. Finally, we outline modifications and improvements that could be made in future works.
  • Keywords
    linguistics; natural language processing; publishing; statistical analysis; text analysis; Arab countries; Arabic corpora processing tools; Arabic corpus-based studies; Arabic language processing; Arabic linguists; Khawas; collocations; concordance; corpus-based linguistic studies; language resources; n-gram frequency; newspapers; resource-poor language; standard Arabic varieties; statistical comparison; Availability; Communities; Educational institutions; Internet; Pragmatics; Text categorization; Writing; Arabic concordance; Arabic corpora; Arabic language processing; N-grams; collocation; corpora comparison; language resources;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Asian Language Processing (IALP), 2013 International Conference on
  • Conference_Location
    Urumqi
  • Type

    conf

  • DOI
    10.1109/IALP.2013.21
  • Filename
    6646005