OCR-Free Table of Contents Detection in Urdu Books

Author

Ul-Hasan, Adnan ; Bukhari, Syed Saqib ; Shafait, Faisal ; Breuel, Thomas M.

Author_Institution

Dept. of Comput. Sci., Tech. Univ. of Kaiserslautern, Kaiserslautern, Germany

fYear

2012

fDate

27-29 March 2012

Firstpage

404

Lastpage

408

Abstract

Table of Contents (ToC) is an integral part of multiple-page documents like books, magazines, etc. Most of the existing techniques use textual similarity for automatically detecting ToC pages. However, such techniques may not be applied for detection of ToC pages in situations where OCR technology is not available, which is indeed true for historical documents and many modern Nabataean (Arabic) and Indic scripts. It is, therefore, necessary to develop tools to navigate through such documents without the use of OCR. This paper reports a preliminary effort to address this challenge. The proposed algorithm has been applied to find Table of Contents (ToC) pages in Urdu books and an overall initial accuracy of 88% has been achieved.

Keywords

document image processing; history; Indic scripts; OCR technology; OCR-free table of contents detection; Urdu books; historical documents; magazines; modern Nabataean scripts; multiple-page documents; textual similarity; Feature extraction; Image segmentation; Navigation; Optical character recognition software; Text analysis; Training; Vectors; Auto MLP; Book structure extraction; OCR-free ToC detection; Urdu document image analysis;

fLanguage

English

Publisher

ieee

Conference_Titel

Document Analysis Systems (DAS), 2012 10th IAPR International Workshop on

Conference_Location

Gold Cost, QLD

Print_ISBN

978-1-4673-0868-7

Type

conf

DOI

10.1109/DAS.2012.59

Filename

6195403