Automatic identification of English, Chinese, Arabic, Devnagari and Bangla script line

Author

Pal, U. ; Chaudhuri, B.B.

Author_Institution

Comput. Vision & Pattern Recognition Unit, Indian Stat. Inst., Calcutta, India

fYear

2001

fDate

6/23/1905 12:00:00 AM

Firstpage

790

Lastpage

794

Abstract

In a general situation, a document page may contain several scriptforms. For optical character recognition (OCR) of such a document page, it is necessary to separate the scripts before feeding them to their individual OCR systems. An automatic technique for the identification of printed Roman, Chinese, Arabic, Devnagari and Bangla text lines from a single document is proposed. Shape based features, statistical features and some features obtained from the concept of a water reservoir are used for script identification. The proposed scheme has an accuracy of about 97.33%

Keywords

document image processing; feature extraction; natural languages; optical character recognition; Arabic; Bangla script; Chinese; Devnagari; English; OCR systems; automatic script line identification; automatic technique; document page; optical character recognition; printed Roman text; printed text line identification; script forms; shape based features; statistical features; water reservoir; Computer vision; Fractals; Optical character recognition software; Optical devices; Pattern recognition; Probability; Reservoirs; Shape; Water resources; Water storage;

fLanguage

English

Publisher

ieee

Conference_Titel

Document Analysis and Recognition, 2001. Proceedings. Sixth International Conference on

Conference_Location

Seattle, WA

Print_ISBN

0-7695-1263-1

Type

conf

DOI

10.1109/ICDAR.2001.953896

Filename

953896