Title :
Extracting information from handwritten content in census forms
Author :
Huaigu Cao ; Subramanian, Kartick ; Xujun Peng ; Jinying Chen ; Prasad, Ranga ; Natarajan, Prem
Author_Institution :
Raytheon BBN Technol., Cambridge, MA, USA
Abstract :
In this paper, we describe our approach for extracting salient information from US census form images. These forms present several challenges including variations in individual form templates, skew, writing device, writing style, etc. We describe an innovative registration algorithm that is robust to scale variations for segmenting the input image into cells. Following registration, the borders of cells are removed using a shape-based rule-line removal algorithm to extract handwritten content from each cell. Finally, the individual cell images are recognized using a hidden Markov model (HMM) OCR system with language models biased for the type of information in the cell, such as person name, place name, numbers, marital status, gender, race, etc.
Keywords :
hidden Markov models; image registration; image segmentation; optical character recognition; HMM; OCR system; US census form images; gender; handwritten content extraction; hidden Markov model; individual cell images; individual form templates; innovative registration algorithm; input image segmentation; language models; marital status; numbers; person name; place name; race; salient information extraction; shape-based rule-line removal algorithm; skew; writing device; writing style; Data mining; Handwriting recognition; Hidden Markov models; Image recognition; Optical character recognition software; Writing;
Conference_Titel :
Pattern Recognition (ICPR), 2012 21st International Conference on
Conference_Location :
Tsukuba
Print_ISBN :
978-1-4673-2216-4