• DocumentCode
    2011222
  • Title

    An Efficient Coarse-to-Fine Indexing Technique for Fast Text Retrieval in Historical Documents

  • Author

    Roy, Partha Pratim ; Rayar, Frédéric ; Ramel, Jean-Yves

  • Author_Institution
    Lab. d´´Inf., Univ. Francois Rabelais, Tours, France
  • fYear
    2012
  • fDate
    27-29 March 2012
  • Firstpage
    150
  • Lastpage
    154
  • Abstract
    In this paper, we present a fast text retrieval system to index and browse degraded historical documents. The indexing and retrieval strategy is designed in a two level, coarse-to-fine approach, to increase the speed of the retrieval process. During the indexing step, the text parts in the images are encoded into sequences of primitives, obtained from two different codebooks: a coarse one corresponding to connected components and a fine one corresponding to glyph primitives. A glyph consists of a single character or a part of a character according to the shape complexity. During the querying step, the coarse and the fine signature are generated from the query image using both codebooks. Then, a bi-level approximate string matching algorithm is applied to find similar words, using coarse approach first, and then the fine approach if necessary, by exploiting predetermined hypothetical locations. An experimental evaluation on datasets of real life document images, gathered from historical books of different scripts, demonstrated the speed improvement and good accuracy in presence of degradation.
  • Keywords
    approximation theory; document image processing; image coding; image matching; image retrieval; image sequences; indexing; string matching; text analysis; bilevel approximate string matching algorithm; coarse-to-fine indexing technique; codebooks; degraded historical document browsing; glyph primitives; image encoding; image sequences; query image; querying step; shape complexity; text retrieval system; word similarity; Approximation algorithms; Degradation; Image segmentation; Indexing; Reservoirs; Shape; Approximate String Matching; Historical Documents; Word Spotting;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Document Analysis Systems (DAS), 2012 10th IAPR International Workshop on
  • Conference_Location
    Gold Cost, QLD
  • Print_ISBN
    978-1-4673-0868-7
  • Type

    conf

  • DOI
    10.1109/DAS.2012.17
  • Filename
    6195353