DocumentCode :
3695228
Title :
The ENP image and ground truth dataset of historical newspapers
Author :
Christian Clausner;Christos Papadopoulos;Stefan Pletschacher;Apostolos Antonacopoulos
Author_Institution :
Pattern Recognition and Image Analysis (PRImA) Research Lab, School of Computing, Science and Engineering, University of Salford, Greater Manchester, United Kingdom
fYear :
2015
Firstpage :
931
Lastpage :
935
Abstract :
This paper presents a research dataset of historical newspapers comprising over 500 page images, uniquely representative of European cultural heritage from the digitization projects of 12 national and major European libraries, created within the scope of the large-scale digitisation Europeana Newspapers Project (ENP). Every image is accompanied by comprehensive ground truth (Unicode encoded full-text, layout information with precise region outlines, type labels, and reading order) in PAGE format and searchable metadata about document characteristics and artefacts. The first part of the paper describes the nature of the dataset, how it was built, and the challenges encountered. In the second part, a baseline for two state-of-the-art OCR systems (ABBYY FineReader Engine 11 and Tesseract 3.03) is given with regard to both text recognition and segmentation/layout analysis performance.
Keywords :
"Optical character recognition software","Europe","Engines"
Publisher :
ieee
Conference_Titel :
Document Analysis and Recognition (ICDAR), 2015 13th International Conference on
Type :
conf
DOI :
10.1109/ICDAR.2015.7333898
Filename :
7333898
Link To Document :
بازگشت