DocumentCode :
153367
Title :
Adapting Tesseract for Complex Scripts: An Example for Urdu Nastalique
Author :
Akram, Qurat Ul Ain ; Hussain, Shiraz ; Niazi, Aneta ; Anjum, Umair ; Irfan, Faheem
Author_Institution :
Center for Language Eng., Univ. of Eng. & Technol., Lahore, Pakistan
fYear :
2014
fDate :
7-10 April 2014
Firstpage :
191
Lastpage :
195
Abstract :
Tesseract engine supports multilingual text recognition. However, the recognition of cursive scripts using Tesseract is a challenging task. In this paper, Tesseract engine is analyzed and modified for the recognition of Nastalique writing style for Urdu language which is a very complex and cursive writing style of Arabic script. Original Tesseract system has 65.59% and 65.84% accuracies for 14 and 16 font sizes respectively, whereas the modified system, with reduced search space, gives 97.87% and 97.71% accuracies respectively. The efficiency is also improved from an average of 170 milliseconds (ms) to an average of 84 ms for the recognition of Nastalique document images.
Keywords :
document image processing; handwritten character recognition; natural language processing; Arabic script; Nastalique document images; Nastalique writing style; Tesseract engine; Urdu language; complex scripts; cursive scripts recognition; multilingual text recognition; Accuracy; Character recognition; Engines; Optical character recognition software; Shape; Text recognition; Writing; Nastalique; OCR; Tesseract; Urdu;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Document Analysis Systems (DAS), 2014 11th IAPR International Workshop on
Conference_Location :
Tours
Print_ISBN :
978-1-4799-3243-6
Type :
conf
DOI :
10.1109/DAS.2014.45
Filename :
6830996
Link To Document :
بازگشت