DocumentCode
1583338
Title
Automatic identification of English, Chinese, Arabic, Devnagari and Bangla script line
Author
Pal, U. ; Chaudhuri, B.B.
Author_Institution
Comput. Vision & Pattern Recognition Unit, Indian Stat. Inst., Calcutta, India
fYear
2001
fDate
6/23/1905 12:00:00 AM
Firstpage
790
Lastpage
794
Abstract
In a general situation, a document page may contain several scriptforms. For optical character recognition (OCR) of such a document page, it is necessary to separate the scripts before feeding them to their individual OCR systems. An automatic technique for the identification of printed Roman, Chinese, Arabic, Devnagari and Bangla text lines from a single document is proposed. Shape based features, statistical features and some features obtained from the concept of a water reservoir are used for script identification. The proposed scheme has an accuracy of about 97.33%
Keywords
document image processing; feature extraction; natural languages; optical character recognition; Arabic; Bangla script; Chinese; Devnagari; English; OCR systems; automatic script line identification; automatic technique; document page; optical character recognition; printed Roman text; printed text line identification; script forms; shape based features; statistical features; water reservoir; Computer vision; Fractals; Optical character recognition software; Optical devices; Pattern recognition; Probability; Reservoirs; Shape; Water resources; Water storage;
fLanguage
English
Publisher
ieee
Conference_Titel
Document Analysis and Recognition, 2001. Proceedings. Sixth International Conference on
Conference_Location
Seattle, WA
Print_ISBN
0-7695-1263-1
Type
conf
DOI
10.1109/ICDAR.2001.953896
Filename
953896
Link To Document