Title :
Detecting OOV Names in Arabic Handwritten Data
Author :
Jinying Chen ; Prasad, Ranga ; Huaigu Cao ; Natarajan, Prem
Author_Institution :
Dept. of Speech, Language & Multimedia, Raytheon BBN Technol., Cambridge, MA, USA
Abstract :
This paper presents a novel approach to detect Arabic OOV names from OCR´ed handwritten documents. In our approach, OOV names are searched for using approximate string match on character consensus networks (cnets). The retrieved regions are re-ranked using novel features representing the quality of the match and the likelihood of the detected region to be an OOV name. Our features that encode word boundary information into the approximate match algorithm significantly improve mean average precision (MAP) by 12.2% (absolute gains) for rank cut-off 100 (48.2% vs. 36.0%) and 11.9% for cut-off 1000 (47.0% vs. 35.1%) over the baseline system. Discriminative reranking based on maximum entropy classification using novel features, such as the probability of a retrieved region being an OOV name (called OOV name probability) from a conditional random field model, further improve MAP by 2.3% (absolute gains) for cut-off 100 and 3.0% for cut-off 1000. The improvements are consistent in DET (Detection Error Tradeoff) curves. Our results show that character cnet based OOV name search benefits clearly from the approximate match using word boundary information and the reranking algorithm. Our experiments also show that OOV name probability is very useful for reranking.
Keywords :
document image processing; handwritten character recognition; image matching; maximum likelihood estimation; object detection; Arabic OOV names detection; DET curve; MAP; OCR handwritten documents; OOV name probability; approximate string match; character consensus networks; conditional random field model; detected region likelihood; detection error tradeoff; feature reranking; maximum entropy classification; mean average precision; optical character recognition; out-of-vocabulary name detection; retrieved region probability; word boundary information; Data models; Feature extraction; Hidden Markov models; Lattices; Optical character recognition software; Speech; Training; OOV name detection; consensus network; handwritten; reranking; spotting;
Conference_Titel :
Document Analysis and Recognition (ICDAR), 2013 12th International Conference on
Conference_Location :
Washington, DC
DOI :
10.1109/ICDAR.2013.200