DocumentCode :
3563324
Title :
Word Level Script and Language Identification for Unconstrained Handwritten Document Images
Author :
Prasanthkumar, P.V. ; Dileesh, E.D.
Author_Institution :
Comput. Sci. & Eng. Gov. Eng. Coll., Thrissur, India
fYear :
2014
Firstpage :
14
Lastpage :
18
Abstract :
Word level Script and language identification is a process of separating the script and language of each word present in a printed or handwritten multi-script document. It is an essential part of a multi-lingual Optical Character Recognizer (OCR). Most of the OCRs are solely designed for a single script. So it can´t convert a document which is written in more than one script. This paper explained a system, which automatically separate the script of an unconstrained handwritten document mix with three Indian scripts and Roman script English up to word level. The process starts from extracting text-lines from the document and then separates the words from the text-line using projection profiles. 15 connected component and morphological features and 32 Gabor filter features is extracted from one word to form a feature set. Total of 15699 words are separated from 215 documents for training and 4856 words from 64 documents for testing. Three different classifiers, Support Vector Machine (SVM), Multilayer Perceptron (MLP), and K-Nearest Neighbors (KNN) classifiers are used for testing the discriminating power of the feature set. MLP classifier outperform over all others with cross validation accuracy of 81.69% across four scripts.
Keywords :
authoring languages; document image processing; multilayer perceptrons; natural language processing; optical character recognition; support vector machines; Gabor filter feature; K-Nearest Neighbors; KNN classifiers; MLP; OCR; Roman script English; SVM; handwritten multiscript document; language identification; morphological features; multilayer perceptron; multilingual optical character recognizer; support vector machine; text line extraction; unconstrained handwritten document images; word level script; Accuracy; Conferences; Feature extraction; Optical character recognition software; Pattern recognition; Support vector machines; Testing; Handwritten documents; Optical Character Recognition; Word level script identification;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Eco-friendly Computing and Communication Systems (ICECCS), 2014 3rd International Conference on
Print_ISBN :
978-1-4799-7003-2
Type :
conf
DOI :
10.1109/Eco-friendly.2014.78
Filename :
7208958
Link To Document :
بازگشت