DocumentCode
2012283
Title
OCR-Free Table of Contents Detection in Urdu Books
Author
Ul-Hasan, Adnan ; Bukhari, Syed Saqib ; Shafait, Faisal ; Breuel, Thomas M.
Author_Institution
Dept. of Comput. Sci., Tech. Univ. of Kaiserslautern, Kaiserslautern, Germany
fYear
2012
fDate
27-29 March 2012
Firstpage
404
Lastpage
408
Abstract
Table of Contents (ToC) is an integral part of multiple-page documents like books, magazines, etc. Most of the existing techniques use textual similarity for automatically detecting ToC pages. However, such techniques may not be applied for detection of ToC pages in situations where OCR technology is not available, which is indeed true for historical documents and many modern Nabataean (Arabic) and Indic scripts. It is, therefore, necessary to develop tools to navigate through such documents without the use of OCR. This paper reports a preliminary effort to address this challenge. The proposed algorithm has been applied to find Table of Contents (ToC) pages in Urdu books and an overall initial accuracy of 88% has been achieved.
Keywords
document image processing; history; Indic scripts; OCR technology; OCR-free table of contents detection; Urdu books; historical documents; magazines; modern Nabataean scripts; multiple-page documents; textual similarity; Feature extraction; Image segmentation; Navigation; Optical character recognition software; Text analysis; Training; Vectors; Auto MLP; Book structure extraction; OCR-free ToC detection; Urdu document image analysis;
fLanguage
English
Publisher
ieee
Conference_Titel
Document Analysis Systems (DAS), 2012 10th IAPR International Workshop on
Conference_Location
Gold Cost, QLD
Print_ISBN
978-1-4673-0868-7
Type
conf
DOI
10.1109/DAS.2012.59
Filename
6195403
Link To Document