• DocumentCode
    65912
  • Title

    Aligned-Parallel-Corpora Based Semi-Supervised Learning for Arabic Mention Detection

  • Author

    Zitouni, Imed ; Benajiba, Yassine

  • Author_Institution
    Microsoft, Redmond, WA, USA
  • Volume
    22
  • Issue
    2
  • fYear
    2014
  • fDate
    Feb. 2014
  • Firstpage
    314
  • Lastpage
    324
  • Abstract
    In the last two decades, significant effort has been put into annotating linguistic resources in several languages. Despite this valiant effort, there are still many languages left that have only small amounts of such resources. The goal of this article is to present and investigate a method of propagating information (specifically mentions) from a resource-rich language such as English into a relatively less-resource language such as Arabic. We compare also this approach to its equivalent counterpart using monolingual resources. Part of the investigation is to quantify the contribution of propagating information in different conditions - based on the availability of resources in the target language. Experiments on the language pair Arabic-English show that one can achieve relatively decent performance by propagating information from a language with richer resources such as English into Arabic alone (no resources or models in the source language Arabic). Furthermore, results show that propagated features from English do help improve the Arabic system performance even when used in conjunction with all feature types built from the source language. Experiments also show that using propagated features in conjunction with lexically-derived features only (as can be obtained directly from a mention annotated corpus) brings the system performance at the one obtained in the target language by using feature derived from many linguistic resources, therefore improving the system when such resources are not available.
  • Keywords
    learning (artificial intelligence); natural language processing; aligned parallel corpora based semisupervised learning; mention detection; monolingual resources; resource rich language; source language; Entropy; Feature extraction; IEEE transactions; Pragmatics; Semisupervised learning; Speech; Speech processing; Information extraction; cross-lingual NLP; machine learning; mention detection; natural language processing;
  • fLanguage
    English
  • Journal_Title
    Audio, Speech, and Language Processing, IEEE/ACM Transactions on
  • Publisher
    ieee
  • ISSN
    2329-9290
  • Type

    jour

  • DOI
    10.1109/TASLP.2013.2287055
  • Filename
    6646259