Title :
Shape and Morphological Transformation Based Features for Language Identification in Indian Document Images
Author :
Hangarge, Mallikarjun ; Dhandra, B.V.
Author_Institution :
P.G.Dept. of Studies & Res. in Comput. Sci., Gulbarga Univ., Gulbarga
Abstract :
In this paper, a technique of language identification in document images is described to discriminate five major Indian languages: Hindi, Marathi, Sanskrit, Assamese and Bengali belong to Devnagari and Bangla scripts. A text block of each language containing at least two text lines is selected and characterized by employing global and local features. Morphological transformations are used to decompose a text block in two directions at three levels, to capture fine texture primitives. Shape features of connected components are used to retain the local properties of the text block. Further, combination of these features is used to classify 500 text blocks of proposed languages based on Binary decision tree and KNN classifier. Proposed method is quite different from reported method on non-Indian languages, which are based on shape coding of characters, words and document vectorization. This method directly captures word shapes without segmentation and it is tolerant to variations in font style and size. The language identification results are encouraging.
Keywords :
decision trees; document image processing; natural language processing; text analysis; Indian document images; KNN classifier; binary decision tree; document vectorization; language identification; morphological transformation; non-Indian languages; shape coding; shape features; text block; Character recognition; Classification tree analysis; Decision trees; Frequency; Image recognition; Image segmentation; Natural languages; Optical character recognition software; Shape; Text recognition; Morphological Transformation; Shape; document image; language identification;
Conference_Titel :
Emerging Trends in Engineering and Technology, 2008. ICETET '08. First International Conference on
Conference_Location :
Nagpur, Maharashtra
Print_ISBN :
978-0-7695-3267-7
Electronic_ISBN :
978-0-7695-3267-7
DOI :
10.1109/ICETET.2008.177