• DocumentCode
    2510907
  • Title

    Document Segmentation Using Pixel-Accurate Ground Truth

  • Author

    An, Chang ; Yin, Dawei ; Baird, Henry S.

  • Author_Institution
    Dept. of Comput. Sci. & Eng., Lehigh Univ., Bethlehem, PA, USA
  • fYear
    2010
  • fDate
    23-26 Aug. 2010
  • Firstpage
    245
  • Lastpage
    248
  • Abstract
    We compare methodologies for trainable document image content extraction, using a variety of ground-truth policies: loose, tight, and pixel-accurate. The goal is to achieve pixel-accurate segmentation of document images. Which ground-truth policy is the best has been debated. ``Loose´´ truth is obtained by sweeping rectangles to enclose entire text blocks etc, and can be an efficient manual task. ``Tight´´ truth requires more care, and more time, to enclose individual text lines. Pixel-accurate truth, in which only foreground pixels are labeled, can be obtained by applying the PARC PixLabeler tool; in our experience this tool was as quick to use as loose truthing. We have compared the accuracy of all three truthing policies, and report that tight truth supports higher accuracy than loose truth, and pixel-accurate truth yields the highest accuracy. We have also experimented on morphological expansions on pixel-accurate truth, by expanding sets of foreground pixels morphologically, and report that expanded pixel-accurate truth supports higher accuracy than pixel-accurate truth.
  • Keywords
    content-based retrieval; document image processing; image resolution; image segmentation; PARC pixlabeler tool; document image content extraction; document segmentation; pixel-accurate ground truth; Accuracy; Error analysis; Feature extraction; Image segmentation; Pixel; Text analysis; Training; document content extraction; iterated classification; layout analysis; pixel-accurate;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Pattern Recognition (ICPR), 2010 20th International Conference on
  • Conference_Location
    Istanbul
  • ISSN
    1051-4651
  • Print_ISBN
    978-1-4244-7542-1
  • Type

    conf

  • DOI
    10.1109/ICPR.2010.69
  • Filename
    5597584