Title :
Exploiting Stroke Orientation for CRF Based Binarization of Historical Documents
Author :
Xujun Peng ; Huaigu Cao ; Subramanian, Kartick ; Prasad, Ranga ; Natarajan, Prem
Author_Institution :
Raytheon BBN Technol., Cambridge, MA, USA
Abstract :
We present a novel binarization method that is especially effective on historical documents with the following characteristics: (a) the documents contain free-form cursive handwritten text with significant but consistent slant, (b) scanning artifacts resulting in the text and background pixels not having uniform intensity even within the same page, and (c) pages containing significant amount of bleeds from the other side of the page. In order to tackle the problem of non-uniform text and background intensity, we use a thresholding algorithm that works equally well for regions of the page containing text and regions of the page containing no text. We then combine this algorithm with a CRF-based framework which handles bleeds using a novel approach to further improve the quality of binarization. We compare the proposed binarization algorithm against other popular binarization algorithms both qualitatively using examples and quantitatively using the word error rate (WER) metric from performing optical character recognition (OCR) on binarized text using the BBN Byblos Offline Handwritten text recognition (OHR) system.
Keywords :
document image processing; handwritten character recognition; history; image resolution; image segmentation; optical character recognition; statistical analysis; text detection; BBN Byblos OHR system; BBN Byblos offline handwritten text recognition system; CRF based binarization method; CRF-based framework; OCR; WER metric; artifact scanning; background pixel intensity; binarization quality; binarized text; bleeds; free-form cursive handwritten text; historical documents; nonuniform background intensity problem; nonuniform text intensity problem; optical character recognition; stroke orientation; text containing page regions; text pixel intensity; thresholding algorithm; word error rate metric; Feature extraction; Hidden Markov models; Ink; Measurement; Optical character recognition software; Text analysis;
Conference_Titel :
Document Analysis and Recognition (ICDAR), 2013 12th International Conference on
Conference_Location :
Washington, DC
DOI :
10.1109/ICDAR.2013.207