Title :
Script and language identification from document images
Author :
Peake, G.S. ; Tan, T.N.
Author_Institution :
Dept. of Comput. Sci., Reading Univ., UK
Abstract :
In this paper we present a detailed review of current script and language identification techniques. The main criticism of the existing techniques is that most of them rely on either connected component analysis or character segmentation. We go on to present a new method based on texture analysis for script identification which does not require character segmentation. A uniform text block on which texture analysis can be performed is produced from a document image via simple processing. Multiple channel (Gabor) filters and grey level co-occurrence matrices are used in independent experiments in order to extract texture features. Classification of test documents is made based on the features of training documents using the K-NN classifier. Initial results of over 95% accuracy on the classification of 105 rest decrements from 7 scripts are very promising. The method shows robustness with respect to noise, the presence of foreign characters or numerals, and can be applied to very small amounts of text
Keywords :
document image processing; feature extraction; image classification; image texture; Gabor filters; K-NN classifier; document images; extract texture features; grey level co-occurrence matrices; language identification; script identification; texture analysis; training documents; Computer science; Document image processing; Image segmentation; Image texture analysis; Independent component analysis; Natural languages; Optical character recognition software; Optical noise; Packaging; Testing;
Conference_Titel :
Document Image Analysis, 1997. (DIA '97) Proceedings., Workshop on
Conference_Location :
San Juan
Print_ISBN :
0-8186-8055-5
DOI :
10.1109/DIA.1997.627086