• DocumentCode
    2010801
  • Title

    An Efficient Framework for Searching Text in Noisy Document Images

  • Author

    Yalniz, Ismet Zeki ; Manmatha, R.

  • Author_Institution
    Dept. of Comput. Sci., Univ. of Massachusetts, Amherst, MA, USA
  • fYear
    2012
  • fDate
    27-29 March 2012
  • Firstpage
    48
  • Lastpage
    52
  • Abstract
    An efficient word spotting framework is proposed to search text in scanned books. The proposed method allows one to search for words when optical character recognition (OCR) fails due to noise or for languages where there is no OCR. Given a query word image, the aim is to retrieve matching words in the book sorted by the similarity. In the offline stage, SIFT descriptors are extracted over the corner points of each word image. Those features are quantized into visual terms (visterms) using hierarchical K-Means algorithm and indexed using an inverted file. In the query resolution stage, the candidate matches are efficiently identified using the inverted index. These word images are then forwarded to the next stage where the configuration of visterms on the image plane are tested. Configuration matching is efficiently performed by projecting the visterms on the horizontal axis and searching for the Longest Common Subsequence (LCS) between the sequences of visterms. The proposed framework is tested on one English and two Telugu books. It is shown that the proposed method resolves a typical user query under 10 milliseconds providing very high retrieval accuracy (Mean Average Precision 0.93). The search accuracy for the English book is comparable to searching text in the high accuracy output of a commercial OCR engine.
  • Keywords
    document image processing; image matching; image retrieval; natural language processing; optical character recognition; pattern clustering; text analysis; transforms; word processing; English book; SIFT descriptors; Telugu books; commercial OCR engine; feature quantization; hierarchical k-means algorithm; image plane; indexing; inverted file; longest common subsequence; matching word retrieval accuracy; mean average precision; noisy document image; optical character recognition; query resolution stage; query word image; scanned books; text searching; user query; visterm configuration matching; visual terms; word searching; word spotting framework; Accuracy; Detectors; Feature extraction; Image resolution; Noise; Optical character recognition software; Vocabulary; document image search; image retrieval; word spotting;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Document Analysis Systems (DAS), 2012 10th IAPR International Workshop on
  • Conference_Location
    Gold Cost, QLD
  • Print_ISBN
    978-1-4673-0868-7
  • Type

    conf

  • DOI
    10.1109/DAS.2012.18
  • Filename
    6195333