Title :
Rule-based middle-level character detection for simplifying Thai document layout analysis
Author :
Yingsaeree, Chaiyakorn ; Kawtrakul, Asanee
Author_Institution :
Dept. of Comput. Eng., Kasetsart Univ., Bangkok, Thailand
fDate :
29 Aug.-1 Sept. 2005
Abstract :
Although research interest in machine printed Thai character recognition has been an intense research area in the past decade, there are only a few results available for Thai document layout analysis. In addition, directly using the method proposed for other languages with Thai documents is not possible since Thai documents have a unique characteristic (i.e., Thai characters can be placed in four different levels). This paper proposed an approach to eliminate that characteristic by removing nonmiddle-level characters from the image based on heuristic rules derived from Thai language properties: nonmiddle-level characters are usually smaller than middle-level characters, and the gap between each level is smaller than the gap between two consecutive lines. After they are removed, one can use any existing methods with Thai documents without any modification. The experimental results show that the proposed method can effectively remove nonmiddle-level characters from 200 test images with 99.46% accuracy even when the image contains various font sizes.
Keywords :
character recognition; document image processing; natural languages; Thai document layout analysis; Thai language; character recognition; heuristic rules; rule-based middle-level character detection; Algorithm design and analysis; Character recognition; Image analysis; Image segmentation; Natural languages; Performance analysis; Publishing; Speech analysis; Testing; Text analysis;
Conference_Titel :
Document Analysis and Recognition, 2005. Proceedings. Eighth International Conference on
Print_ISBN :
0-7695-2420-6
DOI :
10.1109/ICDAR.2005.204