• DocumentCode
    2012283
  • Title

    OCR-Free Table of Contents Detection in Urdu Books

  • Author

    Ul-Hasan, Adnan ; Bukhari, Syed Saqib ; Shafait, Faisal ; Breuel, Thomas M.

  • Author_Institution
    Dept. of Comput. Sci., Tech. Univ. of Kaiserslautern, Kaiserslautern, Germany
  • fYear
    2012
  • fDate
    27-29 March 2012
  • Firstpage
    404
  • Lastpage
    408
  • Abstract
    Table of Contents (ToC) is an integral part of multiple-page documents like books, magazines, etc. Most of the existing techniques use textual similarity for automatically detecting ToC pages. However, such techniques may not be applied for detection of ToC pages in situations where OCR technology is not available, which is indeed true for historical documents and many modern Nabataean (Arabic) and Indic scripts. It is, therefore, necessary to develop tools to navigate through such documents without the use of OCR. This paper reports a preliminary effort to address this challenge. The proposed algorithm has been applied to find Table of Contents (ToC) pages in Urdu books and an overall initial accuracy of 88% has been achieved.
  • Keywords
    document image processing; history; Indic scripts; OCR technology; OCR-free table of contents detection; Urdu books; historical documents; magazines; modern Nabataean scripts; multiple-page documents; textual similarity; Feature extraction; Image segmentation; Navigation; Optical character recognition software; Text analysis; Training; Vectors; Auto MLP; Book structure extraction; OCR-free ToC detection; Urdu document image analysis;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Document Analysis Systems (DAS), 2012 10th IAPR International Workshop on
  • Conference_Location
    Gold Cost, QLD
  • Print_ISBN
    978-1-4673-0868-7
  • Type

    conf

  • DOI
    10.1109/DAS.2012.59
  • Filename
    6195403