• DocumentCode
    2802934
  • Title

    Augmenting Historical Manuscripts with Automatic Hyperlinks

  • Author

    Wang, Xiaoyue ; Keogh, Eamonn

  • Author_Institution
    Dept. of Comput. Sci. & Eng., Univ. of California Riverside, Riverside, CA, USA
  • fYear
    2009
  • fDate
    14-16 Dec. 2009
  • Firstpage
    571
  • Lastpage
    576
  • Abstract
    Hyperlinks are so useful for searching and browsing modern digital collections that researchers have longer wondered if it is possible to retroactively add hyperlinks to digitized historical documents. There has already been significant research into this endeavor for historical text; however, in this work we consider the problem of adding hyperlinks among graphic elements. While such a system would not have the ubiquitous utility of text-based hyperlinks, as we will show, there are several domains where it can significantly augment textual information. While OCR of historical text is known to be a difficult problem, the actual words themselves are inherently discrete. Thus, two words are either identical or not. This means that off-the-shelf machine learning algorithms, including semi-supervised learning, can be easily used. However, as we shall demonstrate, semi-supervised learning does not work well with images, because we cannot expect binary matching decisions. Rather we must deal with degrees of matching. In this work we make the novel observation that this ¿degree of matching¿ biased algorithms make overly confident predictions about simple shapes. We show that a simple technique for correcting this bias, and demonstrate through extensive experiments that our method significantly improves accuracy on diverse historical image collections.
  • Keywords
    learning (artificial intelligence); text analysis; automatic hyperlinks; binary matching decisions; degree of matching biased algorithms; digital collection browsing; digitized historical documents; diverse historical image collections; historical manuscript augmentation; off-the-shelf machine learning algorithms; semisupervised learning; text-based hyperlinks; textual information augmentation; ubiquitous utility; Books; Computer science; Euclidean distance; Graphics; Humans; Machine learning algorithms; Optical character recognition software; Semisupervised learning; Shape measurement; USA Councils; Historical Manuscripts; Hyperlinks; Semi-Supervised Learning;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Multimedia, 2009. ISM '09. 11th IEEE International Symposium on
  • Conference_Location
    San Diego, CA
  • Print_ISBN
    978-1-4244-5231-6
  • Electronic_ISBN
    978-0-7695-3890-7
  • Type

    conf

  • DOI
    10.1109/ISM.2009.34
  • Filename
    5362526