• DocumentCode
    2912950
  • Title

    Document Layout Analysis and Classification and Its Application in OCR

  • Author

    Gupta, Gaurav ; Niranjan, Shobhit ; Shrivastava, Ankit ; Sinha, Dr RMK

  • Author_Institution
    Indian Institute of Technology Kanpur
  • fYear
    2006
  • fDate
    16-20 Oct. 2006
  • Firstpage
    58
  • Lastpage
    58
  • Abstract
    Digitization of paper-bound documents is one of the foremost commercial interests worldwide. First step in all such applications is transforming a paper bound document into an electronic document by scanning, subsequently applying to the image OCR to generate textual information from the document image. In this paper we describe our work that acts as a pre-processing stage for OCR application. Automatic document layout extraction and segmentation is done using spatial configuration of various text/image segments represented as bounded boxes; this segmented layout is than analyzed with certain heuristic tests and each segment is assigned labels (title, authors, abstract, body, header, footer etc). This information is than passed on to OCR module as an XML interface, accelerating it¿s performance by allowing it to label recognized text segments and identifying only those parts of the document which have text resulting saving in computation. Although, the work has been motivated for application to an automated machine translation system preserving the overall document layout, it has a number of other applications such as in information retrieval, search etc. This information is also being used to classify technical documents into three categories which can be extended to any number of classes based on spatial configuration heuristics.
  • Keywords
    Acceleration; Automatic testing; Data mining; Image analysis; Image generation; Image segmentation; Optical character recognition software; Text analysis; Text recognition; XML;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Enterprise Distributed Object Computing Conference Workshops, 2006. EDOCW '06. 10th IEEE International
  • Conference_Location
    Hong Kong, China
  • Print_ISBN
    0-7695-2743-4
  • Type

    conf

  • DOI
    10.1109/EDOCW.2006.29
  • Filename
    4031317