DocumentCode :
675617
Title :
A highly effective approach for document page layout extraction system
Author :
Tangwongsan, Supachai ; Boondireke, Cholticha
Author_Institution :
Fac. of Inf. & Commun. Technol., Mahidol Univ., Bangkok, Thailand
fYear :
2013
fDate :
17-19 Dec. 2013
Firstpage :
85
Lastpage :
90
Abstract :
In this paper, we propose a highly effective scheme for document page layout extraction system as a part of character recognition processes. There are 3 stages in the working model, namely document segmentation, document layout classification and document reading order determination. In the first stage, a hybrid document segmentation decomposes a page of the document image into a variety of blocks by using the combination of diagonal white runs and vertical edges segmentation, together with modified histogram projection. Next, the features related to geometric layout in the page are extracted by using the feature analysis, combined with the technique of rule-based approach for classifying those block types and attributes. In the third stage, a highly efficient algorithm is introduced for block order sequencing search (BOSS) as to determine the right reading sequences of blocks in the page. The model is then tested on a large number of samples of those bilingual documents with Thai and English languages, and with different geometric patterns, multiple columns, rows, fonts and sizes. The results show quite a promising one with accuracy rate of 99.47%, and the speed of 2.887 seconds per page on the average in the experiment.
Keywords :
character recognition; document image processing; feature extraction; image classification; image segmentation; BOSS; English languages; Thai languages; bilingual documents; block attributes; block order sequencing search; block types; character recognition process; diagonal white runs; document image decomposition; document layout classification; document page layout extraction system; document reading order determination; document segmentation; feature analysis; feature extraction; geometric layout; geometric patterns; modified histogram projection; rule-based approach; vertical edges segmentation; Character recognition; Feature extraction; Image edge detection; Image segmentation; Layout; Niobium; Sequential analysis; Pattern recognition; document segmentation; page layout classification; reading order determination;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Wavelet Active Media Technology and Information Processing (ICCWAMTIP), 2013 10th International Computer Conference on
Conference_Location :
Chengdu
Print_ISBN :
978-1-4799-2445-5
Type :
conf
DOI :
10.1109/ICCWAMTIP.2013.6716605
Filename :
6716605
Link To Document :
بازگشت