DocumentCode :
3222350
Title :
Page layout analyser for multilingual Indian documents
Author :
Chaudhuri, A. Ray ; Mandal, A.K. ; Chaudhuri, B.B.
Author_Institution :
Comput. Vision & Pattern Recognition Unit, Indian Stat. Inst., Kolkata, India
fYear :
2002
fDate :
13-15 Dec. 2002
Firstpage :
24
Lastpage :
32
Abstract :
An advanced Optical Character Recognition (OCR) system is equipped with the module of the page layout analyser. It separates textual zones from non-textual zones. It identifies textual blocks from multicolumn documents and groups them into homogenous regions in terms of geometric shape and spatial distribution. All existing OCR modules developed for various Indian scripts can handle text only single-column documents. In this paper, a page, layout analyser that uses typical common features present in most of the Indian scripts is introduced. A simple compatibility criterion that allows various degrees of homogeneity is defined. The page-analyser is robust in the sense that it can distinguish text regions from non-textual entities such as images, rulers, and noisy signals due to smudges and poor quality of the paper. Test results are shown in two most popular Indian Scripts, Devnagari (Hindi) and Bangla.
Keywords :
optical character recognition; Bangla; Devnagari; Hindi; advanced Optical Character Recognition system; compatibility criterion; multicolumn documents; multilingual Indian documents; page layout analyser; single-column documents; textual blocks; textual zones; Character recognition; Geometrical optics; Optical character recognition software; Optical sensors; Robustness; Shape; Testing;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Language Engineering Conference, 2002. Proceedings
Print_ISBN :
0-7695-1885-0
Type :
conf
DOI :
10.1109/LEC.2002.1182287
Filename :
1182287
Link To Document :
بازگشت