• DocumentCode
    2718975
  • Title

    Autonomous cleaning of corrupted scanned documents — A generative modeling approach

  • Author

    Dai, Zhenwen ; Lücke, Jörg

  • Author_Institution
    Dept. of Phys., Goethe-Univ. Frankfurt, Frankfurt am Main, Germany
  • fYear
    2012
  • fDate
    16-21 June 2012
  • Firstpage
    3338
  • Lastpage
    3345
  • Abstract
    We study the task of cleaning scanned text documents that are strongly corrupted by dirt such as manual line strokes, spilled ink etc. We aim at autonomously removing dirt from a single letter-size page based only on the information the page contains. Our approach, therefore, has to learn character representations without supervision and requires a mechanism to distinguish learned representations from irregular patterns. To learn character representations, we use a probabilistic generative model parameterizing pattern features, feature variances, the features´ planar arrangements, and pattern frequencies. The latent variables of the model describe pattern class, pattern position, and the presence or absence of individual pattern features. The model parameters are optimized using a novel variational EM approximation. After learning, the parameters represent, independently of their absolute position, planar feature arrangements and their variances. A quality measure defined based on the learned representation then allows for an autonomous discrimination between regular character patterns and the irregular patterns making up the dirt. The irregular patterns can thus be removed to clean the document. For a full Latin alphabet we found that a single page does not contain sufficiently many character examples. However, even if heavily corrupted by dirt, we show that a page containing a lower number of character types can efficiently and autonomously be cleaned solely based on the structural regularity of the characters it contains. In different examples using characters from different alphabets, we demonstrate generality of the approach and discuss its implications for future developments.
  • Keywords
    approximation theory; document image processing; expectation-maximisation algorithm; feature extraction; learning (artificial intelligence); probability; text analysis; Latin alphabet; autonomous cleaning; autonomous discrimination; character representation learning; character structural regularity; corrupted scanned text documents; feature planar arrangement; feature variances; irregular character patterns; manual line strokes; pattern class; pattern features; pattern frequency; pattern position; probabilistic generative modelling approach; spilled ink; variational EM approximation; Approximation methods; Computational modeling; Data models; Histograms; Probabilistic logic; Robustness; Vectors;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on
  • Conference_Location
    Providence, RI
  • ISSN
    1063-6919
  • Print_ISBN
    978-1-4673-1226-4
  • Electronic_ISBN
    1063-6919
  • Type

    conf

  • DOI
    10.1109/CVPR.2012.6248072
  • Filename
    6248072