• DocumentCode
    185582
  • Title

    An annotated Urdu corpus of handwritten text image and benchmarking of corpus

  • Author

    Choudhary, Prateek ; Nain, N.

  • Author_Institution
    Nat. Inst. of Technol. Manipur/Comput. Sci. & Eng., Imphal, India
  • fYear
    2014
  • fDate
    26-30 May 2014
  • Firstpage
    1159
  • Lastpage
    1164
  • Abstract
    For linguistics related research on a language there is always a need for a large collection of database which includes all features of a language such as grammatical information, style of writing, syntax etc. Corpus provides a platform for investigation on a natural language. As compared to other languages very limited research work is done on Urdu language due to its segmentation dilemma and difficult character shape. Very less number of editable printed text data is available in Urdu language, most of the data is available in graphical or picture format. To increase Natural Language Processing research work on Urdu language there is a need for a large database which contains a range of variance in annotated Urdu handwritten as well as printed text. In our work we purpose a large database of Urdu text including 1000 handwritten text images written by 500 different writers. Each image would be four to six lines of Urdu text having 60-80 words per line the estimated number of words would be around .35 million. Selection of words would be done from six different categories so that maximum number of distinct words can be included. Corpus would be annotated for line as well as word segmentation where a word may be an individual character or component. The corpus would be a benchmark for quantitative analysis of Handwritten Text Recognition techniques for Urdu language such as text line extraction, word segmentation and character recognition etc., and for linguistic research in Part of Speech, writer identification, dictionary etc.
  • Keywords
    dictionaries; document image processing; feature extraction; handwritten character recognition; image segmentation; linguistics; natural language processing; optical character recognition; text analysis; visual databases; Urdu language; annotated Urdu corpus benchmarking; character recognition; character shape; dictionary; grammatical information; handwritten text image; handwritten text recognition technique; linguistics related research; natural language processing; part-of-speech; syntax; text line extraction; word segmentation; writer identification; writing style; Benchmark testing; Computers; Databases; Image segmentation; Labeling; Pragmatics; Syntactics; Corpus annotation; Corpus creation; Urdu corpus; Urdu handwritten image database; Urdu language;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Information and Communication Technology, Electronics and Microelectronics (MIPRO), 2014 37th International Convention on
  • Conference_Location
    Opatija
  • Print_ISBN
    978-953-233-081-6
  • Type

    conf

  • DOI
    10.1109/MIPRO.2014.6859743
  • Filename
    6859743