DocumentCode :
2912950
Title :
Document Layout Analysis and Classification and Its Application in OCR
Author :
Gupta, Gaurav ; Niranjan, Shobhit ; Shrivastava, Ankit ; Sinha, Dr RMK
Author_Institution :
Indian Institute of Technology Kanpur
fYear :
2006
fDate :
16-20 Oct. 2006
Firstpage :
58
Lastpage :
58
Abstract :
Digitization of paper-bound documents is one of the foremost commercial interests worldwide. First step in all such applications is transforming a paper bound document into an electronic document by scanning, subsequently applying to the image OCR to generate textual information from the document image. In this paper we describe our work that acts as a pre-processing stage for OCR application. Automatic document layout extraction and segmentation is done using spatial configuration of various text/image segments represented as bounded boxes; this segmented layout is than analyzed with certain heuristic tests and each segment is assigned labels (title, authors, abstract, body, header, footer etc). This information is than passed on to OCR module as an XML interface, accelerating it¿s performance by allowing it to label recognized text segments and identifying only those parts of the document which have text resulting saving in computation. Although, the work has been motivated for application to an automated machine translation system preserving the overall document layout, it has a number of other applications such as in information retrieval, search etc. This information is also being used to classify technical documents into three categories which can be extended to any number of classes based on spatial configuration heuristics.
Keywords :
Acceleration; Automatic testing; Data mining; Image analysis; Image generation; Image segmentation; Optical character recognition software; Text analysis; Text recognition; XML;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Enterprise Distributed Object Computing Conference Workshops, 2006. EDOCW '06. 10th IEEE International
Conference_Location :
Hong Kong, China
Print_ISBN :
0-7695-2743-4
Type :
conf
DOI :
10.1109/EDOCW.2006.29
Filename :
4031317
Link To Document :
بازگشت