DocumentCode
949174
Title
Script and Language Identification in Noisy and Degraded Document Images
Author
Shijian, Lu ; Tan, Chew Lim
Author_Institution
Nat. Univ. of Singapore, Singapore
Volume
30
Issue
1
fYear
2008
Firstpage
14
Lastpage
24
Abstract
This paper reports an identification technique that detects scripts and languages of noisy and degraded document images. In the proposed technique, scripts and languages are identified through the document vectorization, which converts each document image into a document vector that characterizes the shape and frequency of the contained character or word images. Document images are vectorized by using vertical component cuts and character extremum points, which are both tolerant to the variation in text fonts and styles, noise, and various types of document degradation. For each script or language under study, a script or language template is first constructed through a training process. Scripts and languages of document images are then determined according to the distances between converted document vectors and the preconstructed script and language templates. Experimental results show that the proposed technique is accurate, easy for extension, and tolerant to noise and various types of document degradation.
Keywords
document image processing; text analysis; character extremum point; character images; document degradation; document images; document text; document vectorization; script-language identification; vertical component cuts; word images; Document analysis; association rules; classification; clustering; language identification; script identification; shape; Algorithms; Artificial Intelligence; Automatic Data Processing; Computer Simulation; Documentation; Image Enhancement; Image Interpretation, Computer-Assisted; Information Storage and Retrieval; Language; Models, Statistical; Natural Language Processing; Pattern Recognition, Automated; User-Computer Interface;
fLanguage
English
Journal_Title
Pattern Analysis and Machine Intelligence, IEEE Transactions on
Publisher
ieee
ISSN
0162-8828
Type
jour
DOI
10.1109/TPAMI.2007.1158
Filename
4359308
Link To Document