An annotated Urdu corpus of handwritten text image and benchmarking of corpus

Author

Choudhary, Prateek ; Nain, N.

Author_Institution

Nat. Inst. of Technol. Manipur/Comput. Sci. & Eng., Imphal, India

fYear

2014

fDate

26-30 May 2014

Firstpage

1159

Lastpage

1164

Abstract

For linguistics related research on a language there is always a need for a large collection of database which includes all features of a language such as grammatical information, style of writing, syntax etc. Corpus provides a platform for investigation on a natural language. As compared to other languages very limited research work is done on Urdu language due to its segmentation dilemma and difficult character shape. Very less number of editable printed text data is available in Urdu language, most of the data is available in graphical or picture format. To increase Natural Language Processing research work on Urdu language there is a need for a large database which contains a range of variance in annotated Urdu handwritten as well as printed text. In our work we purpose a large database of Urdu text including 1000 handwritten text images written by 500 different writers. Each image would be four to six lines of Urdu text having 60-80 words per line the estimated number of words would be around .35 million. Selection of words would be done from six different categories so that maximum number of distinct words can be included. Corpus would be annotated for line as well as word segmentation where a word may be an individual character or component. The corpus would be a benchmark for quantitative analysis of Handwritten Text Recognition techniques for Urdu language such as text line extraction, word segmentation and character recognition etc., and for linguistic research in Part of Speech, writer identification, dictionary etc.

Keywords

dictionaries; document image processing; feature extraction; handwritten character recognition; image segmentation; linguistics; natural language processing; optical character recognition; text analysis; visual databases; Urdu language; annotated Urdu corpus benchmarking; character recognition; character shape; dictionary; grammatical information; handwritten text image; handwritten text recognition technique; linguistics related research; natural language processing; part-of-speech; syntax; text line extraction; word segmentation; writer identification; writing style; Benchmark testing; Computers; Databases; Image segmentation; Labeling; Pragmatics; Syntactics; Corpus annotation; Corpus creation; Urdu corpus; Urdu handwritten image database; Urdu language;

fLanguage

English

Publisher

ieee

Conference_Titel

Information and Communication Technology, Electronics and Microelectronics (MIPRO), 2014 37th International Convention on

Conference_Location

Opatija

Print_ISBN

978-953-233-081-6

Type

conf

DOI

10.1109/MIPRO.2014.6859743

Filename

6859743