• DocumentCode
    3482215
  • Title

    Automatic processing of Arabic text

  • Author

    Osman, Ziad ; Hamandi, L. ; Zantout, Rached ; Sibai, Fadi N.

  • Author_Institution
    Electr. Eng., Beirut Arab Univ., Beirut, Lebanon
  • fYear
    2009
  • fDate
    15-17 Dec. 2009
  • Firstpage
    140
  • Lastpage
    144
  • Abstract
    Automatic recognition of printed and handwritten documents remains an active area of research. Arabic is one of the languages that present special problems. Arabic is cursive and therefore necessitates a segmentation process to determine the boundaries of a character. Arabic characters consist of multiple disconnected parts. Dots and Diacritics are used in many Arabic characters and can appear above or below the main body of the character. In Arabic, the same letter has up to four different forms depending on where it appears in the word and depending on the letters that are adjacent to it. In this paper, a novel approach is described that recognizes Arabic script documents. The method starts by preprocessing which involves binarization, noise reduction, and thinning. The text is then segmented into separate lines. Characters are then segmented by determining bifurcation points that are near the baseline. Segmented characters are then compared to prestored templates to identify the best match. The template comparisons are based on central moments, Hu moments, and Invariant moments. The method is proven to work satisfactorily for scanned printed Arabic text. The paper concludes with a discussion of the drawbacks of the method, and a description of possible solutions.
  • Keywords
    document image processing; feature extraction; image segmentation; optical character recognition; text analysis; Arabic characters; Arabic script documents; Arabic text processing; Hu moments; binarization; central moments; handwritten document recognition; invariant moments; noise reduction; printed document recognition; text segmentation; thinning; Bifurcation; Character recognition; Design engineering; Educational institutions; Feature extraction; Handwriting recognition; Noise reduction; Office automation; Optical character recognition software; Text recognition; Arabic; Feature Extraction; Optical Character Recognition;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Innovations in Information Technology, 2009. IIT '09. International Conference on
  • Conference_Location
    Al Ain
  • Print_ISBN
    978-1-4244-5698-7
  • Type

    conf

  • DOI
    10.1109/IIT.2009.5413793
  • Filename
    5413793