DocumentCode
3742161
Title
A Framework for Compilation of Multi-lingual Handwritten Database: Four Levels XML Ground-Truth
Author
Prakash Choudhary;Neeta Nain;Manindra Nehra
Author_Institution
Dept. of Comput. Sci. &
fYear
2015
Firstpage
649
Lastpage
654
Abstract
In this paper, we are presenting a semi-automatic framework for annotating multi-lingual handwritten texts document images. There is a significant need for a structure that can annotate the coordinate segmentation information of the text present in a handwritten document image to provide a platform for OCR algorithm evaluation. In this paper, we describe an XML based four level annotations of handwritten text image that contain the ground-truth information of script text image in Unicode format. In order to collect the huge amount of data for linguistic researchers, structure provide a way to store and annotate at different four levels: Image, Lines, Words and Characters which aids for benchmarking of various OCRs. Structure would be best source for compilation of an annotated handwritten corpora in systematic and scientific way by storing a labelling(markup) information of image script texts in a Unicode and an XML file format that encapsulates the bounding box pixel information of each level in a collaborative manner. The structure provides useful results based on the annotation for various quantitative and statistical corpus approaches to linguistic analysis.
Keywords
"Databases","XML","Image segmentation","Graphical user interfaces","Pragmatics","Data collection","Optical character recognition software"
Publisher
ieee
Conference_Titel
Signal-Image Technology & Internet-Based Systems (SITIS), 2015 11th International Conference on
Type
conf
DOI
10.1109/SITIS.2015.100
Filename
7400632
Link To Document