• DocumentCode
    3695288
  • Title

    ALIF: A dataset for Arabic embedded text recognition in TV broadcast

  • Author

    Sonia Yousfi;Sid-Ahmed Berrani;Christophe Garcia

  • Author_Institution
    Orange Labs - France Telecom, 35510 Cesson-Sé
  • fYear
    2015
  • Firstpage
    1221
  • Lastpage
    1225
  • Abstract
    This paper proposes a dataset, called ALIF, for Arabic embedded text recognition in TV broadcast. The dataset is publicly available for a non-commercial use. It is composed of a large number of manually annotated text images that were extracted from Arabic TV broadcast. It is the first public dataset dedicated to the development and the evaluation of video Arabic OCR techniques. Text images in the dataset are highly variable in terms of text characteristics (fonts, sizes, colors…) and acquisition conditions (background complexity, low resolution, non-uniform luminosity and contrast…). Moreover, an important part of the dataset is finely annotated, i.e. the text in an image is segmented into characters, paws and words, and each segment is labeled. The dataset can hence be used for both segmentation-based and segmentation-free text recognition techniques. In order to illustrate how the ALIF dataset can be used, the results of an evaluation study that we have conducted on different techniques for Arabic text recognition are also presented.
  • Keywords
    "Optical character recognition software","Artificial intelligence","Iron","Metadata","FAA","Chlorine","Artificial neural networks"
  • Publisher
    ieee
  • Conference_Titel
    Document Analysis and Recognition (ICDAR), 2015 13th International Conference on
  • Type

    conf

  • DOI
    10.1109/ICDAR.2015.7333958
  • Filename
    7333958