Title :
High Performance Layout Analysis of Arabic and Urdu Document Images
Author :
Bukhari, Syed Saqib ; Shafait, Faisal ; Breuel, Thomas M.
Author_Institution :
Tech. Univ. of Kaiserslautern, Kaiserslautern, Germany
Abstract :
Text-lines extraction and their reading order determination is an important step in optical character recognition (OCR) systems. Research in OCR of Arabic script documents has primarily focused on character recognition and therefore most of researchers use primitive methods like projection profile analysis for text-line extraction. Although projection methods achieve good accuracy on clean, skew corrected documents, their performance drops under challenging situations (border noise, skew, complex layouts). This paper presents a robust layout analysis system for extracting text-lines in reading order from scanned Arabic script document images written in different languages (Arabic, Urdu, Persian) and styles (Naskh, Nastaliq). The presented system is based on a suitable combination of different well established techniques for analyzing Latin script documents that have proven to be robust against different types of document image degradations. Evaluation of the presented system on Arabic and Urdu document image datasets consisting of a variety of complex single- and multi-column layouts achieves high accuracies for text and non-text segmentation, text-line extraction, and reading order determination.
Keywords :
document image processing; image segmentation; natural language processing; optical character recognition; text analysis; Latin script document analysis; Urdu document images; document image degradations; high performance layout analysis; multicolumn layouts; optical character recognition systems; projection profile analysis; reading order determination; robust layout analysis system; scanned Arabic script document images; skew- corrected documents; text line extraction; text segmentation; Accuracy; Image resolution; Image segmentation; Layout; Morphology; Performance evaluation; Text analysis; Document Layout Analysis; Reading Order Determination; Text Image Segmentation; Text-Line Segmentation;
Conference_Titel :
Document Analysis and Recognition (ICDAR), 2011 International Conference on
Conference_Location :
Beijing
Print_ISBN :
978-1-4577-1350-7
Electronic_ISBN :
1520-5363
DOI :
10.1109/ICDAR.2011.257