Title :
The BBN Byblos Japanese OCR system
Author :
Macrostie, Ehry ; Natarajan, Premkumar ; Decerbo, Michael ; Prasad, Rohit
Author_Institution :
Dept. Speech & Language Process., BBN Technol., Cambridge, MA, USA
Abstract :
The BBN Byblos OCR system implements a script-independent methodology for OCR using hidden Markov models (HMMs). We have successfully ported the system to Arabic, Pashto, English, and Chinese. We discuss our effort in configuring the system to perform recognition of noisy machine printed Japanese documents. The data for our experimentation was taken from the University of Washington (UW-II) Japanese OCR corpus and the LDC Japanese Business News Supplement corpus. We evaluated the performance of a whole-character configuration in which each character was modeled using a separate HMM. As in the case of our Chinese OCR system [P. Natarajan et al., 2001], we also used a sub-character modeling approach [P. Natarajan et al., 2003] in which each Japanese character was spelled using a shared set of automatically generated sub-characters. We experimentally evaluated the performance of different sub-character clusters as well as different HMM topologies to identify the best overall system configuration. On a fair test using noisy/degraded images from the UW-II corpus, the best sub-character configuration resulted in a character error rate of 20.13%, On relatively cleaner data, consisting of scanned newspaper images, the system delivered an error rate of 7.85%. Using a whole-character configuration the corresponding error rates were 11.94% and 4.55% respectively.
Keywords :
hidden Markov models; natural languages; optical character recognition; BBN Byblos Japanese OCR system; HMM topologies; LDC Japanese Business News Supplement corpus; University of Washington Japanese OCR corpus; hidden Markov models; noisy machine printed Japanese document recognition; script-independent methodology; subcharacter modeling approach; Character generation; Character recognition; Error analysis; Feature extraction; Hidden Markov models; Image segmentation; Natural languages; Optical character recognition software; Pattern recognition; Speech processing;
Conference_Titel :
Pattern Recognition, 2004. ICPR 2004. Proceedings of the 17th International Conference on
Print_ISBN :
0-7695-2128-2
DOI :
10.1109/ICPR.2004.1334341