• DocumentCode
    657579
  • Title

    Developing an Arabic corpus for event mining

  • Author

    Alasfour, Abdel Alnasser A. ; Trausan-Matu, Stefan

  • Author_Institution
    Comput. Sci. Dept., Politeh. Univ. of Bucharest, Bucharest, Romania
  • fYear
    2013
  • fDate
    11-13 Oct. 2013
  • Firstpage
    21
  • Lastpage
    28
  • Abstract
    Recently, Arabic Natural Language Processing (A-NLP) is beginning to gain more interest. Corpora in general, have become a dependable resource for Language Engineering including Information Retrieval, Machine Translation and other Natural Language-related disciplines. As a result, many Arabic corpora have been developed and most of them are available online for linguistics´ researchers. For example, the Agence France-Press (AFP) corpus is an Arabic newswire developed by the Linguistic Data Consortium (LDC) [1,8] and the Quranic Arabic corpus organized by the University of Leeds [5]. For any objective research in NLP, there must be a corpus covering most of the language patterns in variant domains [21]. But, over the years, different new jargons have appeared within the Arabic speaking states. In this paper, a modern standard Arabic is used to avoid any region specific Arabic language patterns [1]. The Organization of Islamic Cooperation (OIC) is selected as a main data source. OIC is the second largest inter-governmental organization after the United Nations, comprising of 57 member states in four continents. Some data is also taken from International Islamic News Agency (IINA). IINA is the informational side of the OIC, working as an electronic newspaper, having electronic categorization of news documents. In future, this corpus will be a part of parallel corpus (Arabic - English). For that reason, we have selected sites with the ability of parallel multilingual document Arabic and English.
  • Keywords
    data mining; linguistics; natural language processing; A-NLP; AFP corpus; Agence France-Press; Arabic corpora; Arabic language patterns; Arabic natural language processing; Arabic newswire; Arabic speaking states; Arabic-English corpus; English language; IINA; International Islamic News Agency; LDC; Linguistic Data Consortium; OIC; Organization of Islamic Cooperation; Quranic Arabic corpus; United Nations; University of Leeds; electronic categorization; electronic newspaper; event mining; inter-governmental organization; jargons; language engineering; modern standard Arabic; news documents; parallel corpus; parallel multilingual document; Educational institutions; HTML; Internet; Natural language processing; Pragmatics; Standards organizations; Web pages; A-NLP; Corpus; Event Mining; Extraction;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    System Theory, Control and Computing (ICSTCC), 2013 17th International Conference
  • Conference_Location
    Sinaia
  • Print_ISBN
    978-1-4799-2227-7
  • Type

    conf

  • DOI
    10.1109/ICSTCC.2013.6688930
  • Filename
    6688930