• DocumentCode
    3742161
  • Title

    A Framework for Compilation of Multi-lingual Handwritten Database: Four Levels XML Ground-Truth

  • Author

    Prakash Choudhary;Neeta Nain;Manindra Nehra

  • Author_Institution
    Dept. of Comput. Sci. &
  • fYear
    2015
  • Firstpage
    649
  • Lastpage
    654
  • Abstract
    In this paper, we are presenting a semi-automatic framework for annotating multi-lingual handwritten texts document images. There is a significant need for a structure that can annotate the coordinate segmentation information of the text present in a handwritten document image to provide a platform for OCR algorithm evaluation. In this paper, we describe an XML based four level annotations of handwritten text image that contain the ground-truth information of script text image in Unicode format. In order to collect the huge amount of data for linguistic researchers, structure provide a way to store and annotate at different four levels: Image, Lines, Words and Characters which aids for benchmarking of various OCRs. Structure would be best source for compilation of an annotated handwritten corpora in systematic and scientific way by storing a labelling(markup) information of image script texts in a Unicode and an XML file format that encapsulates the bounding box pixel information of each level in a collaborative manner. The structure provides useful results based on the annotation for various quantitative and statistical corpus approaches to linguistic analysis.
  • Keywords
    "Databases","XML","Image segmentation","Graphical user interfaces","Pragmatics","Data collection","Optical character recognition software"
  • Publisher
    ieee
  • Conference_Titel
    Signal-Image Technology & Internet-Based Systems (SITIS), 2015 11th International Conference on
  • Type

    conf

  • DOI
    10.1109/SITIS.2015.100
  • Filename
    7400632