DocumentCode :
2347754
Title :
Realization of a high performance bilingual OCR system for Thai-English printed documents
Author :
Tangwongsan, Supachai ; Suvacharakulton, Buntida
Author_Institution :
Fac. of Inf. & Commun. Technol., Mahidol Univ., Bangkok, Thailand
fYear :
2010
fDate :
21-23 Aug. 2010
Firstpage :
1
Lastpage :
6
Abstract :
This paper presents a high performance bilingual OCR system for printed Thai and English text. With the complex nature of both Thai and English languages, the first stage is to identify languages within different zones by using geometric properties for differentiation. The second stage is the process of character recognition, in which the technique developed includes a feature extractor and a classifier. In the feature extraction, the thinned character image is analyzed and categorized into groups. Next, the classifier will take in two steps of recognition: the coarse level, followed by the fine level with a guide of decision trees. As to obtain an even better result, the final stage attempts to make use of dictionary look-up as to check for accuracy improvement in an overall performance. For verification, the system is tested by a series of experiments with printed documents in 141 pages and over 280,000 characters, the result shows that the system could obtain an accuracy of 100% in Thai monolingual, 98.18% in English monolingual, and 99.85% in bilingual documents on the average. In the final stage with a dictionary look-up, the system could yield a better accuracy of improvement up to 99.98% in bilingual documents as expected.
Keywords :
feature extraction; geometry; natural language processing; optical character recognition; pattern classification; Thai-English printed documents; bilingual OCR system; character recognition; classifier; decision trees; feature extraction; geometric properties; Accuracy; Character recognition; Decision trees; Dictionaries; Feature extraction; Optical character recognition software; Testing; Thai-English character recognition; bilingual OCR; dictionary look-up; language identification;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Natural Language Processing and Knowledge Engineering (NLP-KE), 2010 International Conference on
Conference_Location :
Beijing
Print_ISBN :
978-1-4244-6896-6
Type :
conf
DOI :
10.1109/NLPKE.2010.5587781
Filename :
5587781
Link To Document :
بازگشت