• DocumentCode
    2142239
  • Title

    HMM-Based Alignment of Inaccurate Transcriptions for Historical Documents

  • Author

    Fischer, Andreas ; Indermühle, Emanuel ; Frinken, Volkmar ; Bunke, Horst

  • Author_Institution
    Inst. of Comput. Sci. & Appl. Math., Univ. of Bern, Bern, Switzerland
  • fYear
    2011
  • fDate
    18-21 Sept. 2011
  • Firstpage
    53
  • Lastpage
    57
  • Abstract
    For historical documents, available transcriptions typically are inaccurate when compared with the scanned document images. Not only the position of the words and sentences are unknown, but also the correct image transcription may not be matched exactly. An error-tolerant alignment is needed to make the document images amenable to browsing and searching in digital libraries. In this paper, we propose a novel multi-pass alignment method based on Hidden Markov Models (HMM) that combines text line recognition, string alignment, and keyword spotting to cope with word substitutions, deletions, and insertions in the transcription. In a segmentation-free approach, transcriptions of complete pages are aligned with sequences of text line images. On the Parzival data set, results are reported for several degrees of artificial distortions. Both the accuracy and the efficiency of the proposed system are promising for real-world applications.
  • Keywords
    document image processing; hidden Markov models; text analysis; HMM-based alignment; Parzival data set; browsing; digital library; error-tolerant alignment; hidden Markov model; historical documents; image transcription; inaccurate transcription; keyword spotting; multipass alignment method; scanned document image; searching; string alignment; text line recognition; Accuracy; Feature extraction; Handwriting recognition; Hidden Markov models; Image segmentation; Text analysis; Text recognition; handwriting recognition; hidden Markov models;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Document Analysis and Recognition (ICDAR), 2011 International Conference on
  • Conference_Location
    Beijing
  • ISSN
    1520-5363
  • Print_ISBN
    978-1-4577-1350-7
  • Electronic_ISBN
    1520-5363
  • Type

    conf

  • DOI
    10.1109/ICDAR.2011.20
  • Filename
    6065275