Title :
AUT-PFT: A real world printed Farsi text image dataset
Author :
Torabzadeh, Saeed ; Safabaksh, Reza
Author_Institution :
Comput. Eng. Dept., Amirkabir Univ. of Technol., Tehran, Iran
Abstract :
A Comprehensive Database of Farsi printed texts is an essential resource for research in this area. Although there are some Arabic printed databases, but those databases do not have all the necessary features for Farsi or Arabic text recognition research. In this paper, we introduce a comprehensive Farsi printed text database called AUT-PFT. The purpose of this database is to provide a large-scale, real world, multi font and multi size corpus for training Farsi or Arabic text recognition systems. This database is made up of 10000 generated words. 127 unique glyphs are used in these words in a way that appearance distribution of glyphs is approximately uniform. These words are generated with 10 widely used Farsi fonts and 4 different font sizes. In order to have real world noise in this database, all generated images were printed and scanned. Ground truth data are also provided for this database and unlike other databases, detailed information about document text is provided at glyph level.
Keywords :
character recognition; document image processing; image recognition; natural languages; AUT-PFT; Arabic printed database; Arabic text recognition; Farsi text recognition; glyph level; multifont corpus; multisize corpus; printed Farsi text image dataset; Computers; Databases; Noise; Optical character recognition software; Text recognition; Training; XML; AUT-PFT; Farsi printed text; database; ground truth;
Conference_Titel :
Artificial Intelligence and Signal Processing (AISP), 2015 International Symposium on
Conference_Location :
Mashhad
Print_ISBN :
978-1-4799-8817-4
DOI :
10.1109/AISP.2015.7123490