Title :
Identification of Japanese and English Script from a Single Document Page
Author :
Chanda, S. ; Pal, U. ; Kimura, F.
Author_Institution :
Indian Stat. Inst., Kolkata
Abstract :
In Japanese documents, a single text line of a page may contain both Japanese and English scripts. For the optical character recognition of such a document page it is better to identify Japanese and English script portions at first, and then to use individual OCRs of these two scripts on their respective identified portions to get higher OCR accuracy. In this paper, an automatic technique for identification of Japanese and English script portions from a single line of a printed document page is proposed. To the best of our knowledge this is the first work of its kind. Here, at first, the document is segmented into lines and then lines are segmented into characters. In the proposed scheme, individual scripts are identified using combination of different features obtained from structural shape of characters, pitch information, topological properties, water reservoir concept etc. Based on the experiment on 11304 characters, we obtained 98.79% identification accuracy from the proposed scheme.
Keywords :
natural language processing; optical character recognition; text analysis; English script identification; Japanese document page; Japanese script identification; document segmentation; optical character recognition; printed document page text line; Character recognition; Computer vision; Optical character recognition software; Reservoirs; Structural shapes; Support vector machine classification; Support vector machines; Testing; Text recognition; Water resources;
Conference_Titel :
Computer and Information Technology, 2007. CIT 2007. 7th IEEE International Conference on
Conference_Location :
Aizu-Wakamatsu, Fukushima
Print_ISBN :
978-0-7695-2983-7
DOI :
10.1109/CIT.2007.109