DocumentCode :
3742161
Title :
A Framework for Compilation of Multi-lingual Handwritten Database: Four Levels XML Ground-Truth
Author :
Prakash Choudhary;Neeta Nain;Manindra Nehra
Author_Institution :
Dept. of Comput. Sci. &
fYear :
2015
Firstpage :
649
Lastpage :
654
Abstract :
In this paper, we are presenting a semi-automatic framework for annotating multi-lingual handwritten texts document images. There is a significant need for a structure that can annotate the coordinate segmentation information of the text present in a handwritten document image to provide a platform for OCR algorithm evaluation. In this paper, we describe an XML based four level annotations of handwritten text image that contain the ground-truth information of script text image in Unicode format. In order to collect the huge amount of data for linguistic researchers, structure provide a way to store and annotate at different four levels: Image, Lines, Words and Characters which aids for benchmarking of various OCRs. Structure would be best source for compilation of an annotated handwritten corpora in systematic and scientific way by storing a labelling(markup) information of image script texts in a Unicode and an XML file format that encapsulates the bounding box pixel information of each level in a collaborative manner. The structure provides useful results based on the annotation for various quantitative and statistical corpus approaches to linguistic analysis.
Keywords :
"Databases","XML","Image segmentation","Graphical user interfaces","Pragmatics","Data collection","Optical character recognition software"
Publisher :
ieee
Conference_Titel :
Signal-Image Technology & Internet-Based Systems (SITIS), 2015 11th International Conference on
Type :
conf
DOI :
10.1109/SITIS.2015.100
Filename :
7400632
Link To Document :
بازگشت