• DocumentCode
    1637791
  • Title

    Automated Ground Truth Data Generation for Newspaper Document Images

  • Author

    Strecker, Thomas ; van Beusekom, J. ; Albayrak, Sahin ; Breuel, Thomas M.

  • Author_Institution
    DAI Labor Tech. Univ. Berlin, Berlin, Germany
  • fYear
    2009
  • Firstpage
    1275
  • Lastpage
    1279
  • Abstract
    In document image understanding, public datasets with ground-truth are an important part of scientific work. They are not only helpful for developing new methods, but also provide a way of comparing performance. Generating these datasets, however, is time consuming and cost-intensive work, requiring a lot of manual effort. In this paper we both propose a way to semi-automatically generate ground-truthed datasets for newspapers and provide a comprehensive dataset. The focus of this paper is layout analysis ground truth. The proposed two step approach consists of a module which automatically creates layouts and an image matching module which allows to map the ground truth information from the synthetic layout to the scanned version. In the first step, layouts are generated automatically from a news corpus. The output consists of a digital newspaper (PDF file) and an XML file containing geometric and logical layout information. In the second step, the PDF files are printed, scanned and aligned with the synthetic image obtained by rendering the PDF. Finally, the geometric and logical layout ground truth is mapped onto the scanned image.
  • Keywords
    XML; data analysis; document image processing; geometry; image matching; publishing; rendering (computer graphics); XML file; automated ground truth data generation; cost-intensive work; geometry; image matching; newspaper document image; public dataset; time consuming; Artificial intelligence; Focusing; Image analysis; Image matching; Image recognition; Image segmentation; Pattern analysis; Pattern recognition; Text analysis; XML; automatic ground-truth; automatic layout; comparison; dataset; layout analysis;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Document Analysis and Recognition, 2009. ICDAR '09. 10th International Conference on
  • Conference_Location
    Barcelona
  • ISSN
    1520-5363
  • Print_ISBN
    978-1-4244-4500-4
  • Electronic_ISBN
    1520-5363
  • Type

    conf

  • DOI
    10.1109/ICDAR.2009.214
  • Filename
    5277685