Title :
Layout Analysis of Urdu Document Images
Author :
Shafait, Faisal ; Adnan-ul-Hasan ; Keysers, Daniel ; Breuel, Thomas M.
Author_Institution :
German Res. Center for Artificial Intelligence, Kaiserslautern
Abstract :
Layout analysis is a key component of an OCR system. In this paper, we present a layout analysis system for extracting text-lines in reading order from Urdu document images. For this purpose, we evaluate an existing system for Roman script text on Urdu documents and describe its methods and the main changes necessary to adapt it to Urdu script. The main changes are: 1) the text-line model for Roman script is modified to adapt to Urdu text, 2) reading order of an Urdu document is defined. The method is applied to a collection of scanned Urdu documents from various books, magazines, and newspapers. The results show high text-line detection accuracy on scanned images of Urdu prose and poetry books and magazines. The algorithm also works reasonably well on newspaper images. We also identify directions for future work which may further improve the accuracy of the system.
Keywords :
document image processing; natural language processing; text analysis; OCR system; Roman script; Urdu document images; layout analysis; text-line model; text-lines extraction; Algorithm design and analysis; Books; Character recognition; Image analysis; Image segmentation; Layout; Noise robustness; Optical character recognition software; Pattern analysis; Text analysis;
Conference_Titel :
Multitopic Conference, 2006. INMIC '06. IEEE
Conference_Location :
Islamabad
Print_ISBN :
1-4244-0795-8
Electronic_ISBN :
1-4244-0795-8
DOI :
10.1109/INMIC.2006.358180