Title :
Augmenting Historical Manuscripts with Automatic Hyperlinks
Author :
Wang, Xiaoyue ; Keogh, Eamonn
Author_Institution :
Dept. of Comput. Sci. & Eng., Univ. of California Riverside, Riverside, CA, USA
Abstract :
Hyperlinks are so useful for searching and browsing modern digital collections that researchers have longer wondered if it is possible to retroactively add hyperlinks to digitized historical documents. There has already been significant research into this endeavor for historical text; however, in this work we consider the problem of adding hyperlinks among graphic elements. While such a system would not have the ubiquitous utility of text-based hyperlinks, as we will show, there are several domains where it can significantly augment textual information. While OCR of historical text is known to be a difficult problem, the actual words themselves are inherently discrete. Thus, two words are either identical or not. This means that off-the-shelf machine learning algorithms, including semi-supervised learning, can be easily used. However, as we shall demonstrate, semi-supervised learning does not work well with images, because we cannot expect binary matching decisions. Rather we must deal with degrees of matching. In this work we make the novel observation that this ¿degree of matching¿ biased algorithms make overly confident predictions about simple shapes. We show that a simple technique for correcting this bias, and demonstrate through extensive experiments that our method significantly improves accuracy on diverse historical image collections.
Keywords :
learning (artificial intelligence); text analysis; automatic hyperlinks; binary matching decisions; degree of matching biased algorithms; digital collection browsing; digitized historical documents; diverse historical image collections; historical manuscript augmentation; off-the-shelf machine learning algorithms; semisupervised learning; text-based hyperlinks; textual information augmentation; ubiquitous utility; Books; Computer science; Euclidean distance; Graphics; Humans; Machine learning algorithms; Optical character recognition software; Semisupervised learning; Shape measurement; USA Councils; Historical Manuscripts; Hyperlinks; Semi-Supervised Learning;
Conference_Titel :
Multimedia, 2009. ISM '09. 11th IEEE International Symposium on
Conference_Location :
San Diego, CA
Print_ISBN :
978-1-4244-5231-6
Electronic_ISBN :
978-0-7695-3890-7
DOI :
10.1109/ISM.2009.34