• DocumentCode
    2038118
  • Title

    Document layout analysis and reading order determination for a reading robot

  • Author

    Pan, Yucun ; Zhao, Qunfei ; Kamata, Seiichiro

  • Author_Institution
    Sch. of Electron., Inf. & Electr. Eng., Shanghai Jiao Tong Univ., Shanghai, China
  • fYear
    2010
  • fDate
    21-24 Nov. 2010
  • Firstpage
    1607
  • Lastpage
    1612
  • Abstract
    In this paper an efficient approach of document layout analysis and reading order determination is proposed for a reading robot. Firstly the input document images are preprocessed to remove noises, connect lines and domains, and to reduce the computation time. Secondly a bottom-up, parameter-independent, two-step layout analysis algorithm based on morphology is used, which outlines the geometry of the maximum homogeneous regions and classifies them into texts, tables, and pictures. Finally the reading order is determined, by a top-down recursive hierarchy algorithm derived from XY-cut, using a set of rules depending on layout information. Important parameters are acquired using statistic information of the given images to adapt to different types of documents. The proposed algorithm is applied to a large number of document images and the experimental results show that it makes the reading robot be able to read paper documents of different languages, even with complex layout structure.
  • Keywords
    document image processing; optical character recognition; robot vision; XY-cut; computation time reduction; document images; document layout analysis; layout information; reading order determination; reading robot; statistic information; top-down recursive hierarchy algorithm; two step layout analysis algorithm; a reading robot; adaptive; hierarchy; layout analysis; morphology based; reading order determination;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    TENCON 2010 - 2010 IEEE Region 10 Conference
  • Conference_Location
    Fukuoka
  • ISSN
    pending
  • Print_ISBN
    978-1-4244-6889-8
  • Type

    conf

  • DOI
    10.1109/TENCON.2010.5686038
  • Filename
    5686038