DocumentCode :
185582
Title :
An annotated Urdu corpus of handwritten text image and benchmarking of corpus
Author :
Choudhary, Prateek ; Nain, N.
Author_Institution :
Nat. Inst. of Technol. Manipur/Comput. Sci. & Eng., Imphal, India
fYear :
2014
fDate :
26-30 May 2014
Firstpage :
1159
Lastpage :
1164
Abstract :
For linguistics related research on a language there is always a need for a large collection of database which includes all features of a language such as grammatical information, style of writing, syntax etc. Corpus provides a platform for investigation on a natural language. As compared to other languages very limited research work is done on Urdu language due to its segmentation dilemma and difficult character shape. Very less number of editable printed text data is available in Urdu language, most of the data is available in graphical or picture format. To increase Natural Language Processing research work on Urdu language there is a need for a large database which contains a range of variance in annotated Urdu handwritten as well as printed text. In our work we purpose a large database of Urdu text including 1000 handwritten text images written by 500 different writers. Each image would be four to six lines of Urdu text having 60-80 words per line the estimated number of words would be around .35 million. Selection of words would be done from six different categories so that maximum number of distinct words can be included. Corpus would be annotated for line as well as word segmentation where a word may be an individual character or component. The corpus would be a benchmark for quantitative analysis of Handwritten Text Recognition techniques for Urdu language such as text line extraction, word segmentation and character recognition etc., and for linguistic research in Part of Speech, writer identification, dictionary etc.
Keywords :
dictionaries; document image processing; feature extraction; handwritten character recognition; image segmentation; linguistics; natural language processing; optical character recognition; text analysis; visual databases; Urdu language; annotated Urdu corpus benchmarking; character recognition; character shape; dictionary; grammatical information; handwritten text image; handwritten text recognition technique; linguistics related research; natural language processing; part-of-speech; syntax; text line extraction; word segmentation; writer identification; writing style; Benchmark testing; Computers; Databases; Image segmentation; Labeling; Pragmatics; Syntactics; Corpus annotation; Corpus creation; Urdu corpus; Urdu handwritten image database; Urdu language;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Information and Communication Technology, Electronics and Microelectronics (MIPRO), 2014 37th International Convention on
Conference_Location :
Opatija
Print_ISBN :
978-953-233-081-6
Type :
conf
DOI :
10.1109/MIPRO.2014.6859743
Filename :
6859743
Link To Document :
بازگشت