• DocumentCode
    3489570
  • Title

    Table of Contents Recognition and Extraction for Heterogeneous Book Documents

  • Author

    Zhaohui Wu ; MITRA, PINAKI ; Giles, C. Lee

  • Author_Institution
    Comput. Sci. & Eng., Pennsylvania State Univ., University Park, PA, USA
  • fYear
    2013
  • fDate
    25-28 Aug. 2013
  • Firstpage
    1205
  • Lastpage
    1209
  • Abstract
    Existing work on book table of contents (TOC) recognition has been almost all on small size, application-dependent, and domain-specific datasets. However, TOC of books from different domains differ significantly in their visual layout and style, making TOC recognition a challenging problem for a large scale collection of heterogeneous books. We observed that TOCs can be placed into three basic styles, namely "flat", "ordered", and "divided", giving insights into how to achieve effective TOC parsing. As such, we propose a new TOC recognition approach which adaptively decides the most appropriate TOC parsing rules based on the classification of these three TOC styles. Evaluation on large number, over 25,000, of book documents from various domains demonstrates its effectiveness and efficiency.
  • Keywords
    document image processing; feature extraction; grammars; image classification; image recognition; table lookup; TOC parsing rule; application-dependent datasets; book TOC recognition approach; domain-specific datasets; heterogeneous book documents; table of content extraction; table of content recognition; visual layout; visual style; Feature extraction; Joining processes; Measurement; Portable document format; Runtime; Sections; Visualization;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Document Analysis and Recognition (ICDAR), 2013 12th International Conference on
  • Conference_Location
    Washington, DC
  • ISSN
    1520-5363
  • Type

    conf

  • DOI
    10.1109/ICDAR.2013.244
  • Filename
    6628805