DocumentCode :
1919011
Title :
Poster: Digitization and Search: A Non-Traditional Use of HPC
Author :
Diesendruck, Liana ; Marini, Luigi ; Kooper, Rob ; Kejriwal, Mayank ; McHenry, Kenton
fYear :
2012
fDate :
10-16 Nov. 2012
Firstpage :
1462
Lastpage :
1462
Abstract :
We describe our efforts to provide a form of automated search of handwritten content for digitized document archives. To carry out the search we use a computer vision technique called word spotting. A form of content based image retrieval, it avoids the still difficult task of directly recognizing text by allowing a user to search using a query image containing handwritten text and ranking a database of images in terms of those that contain more similar looking content. In order to make this search capability available on an archive three computationally expensive pre-processing steps are required. We augment this automated portion of the process with a passive crowd sourcing element that mines queries from the systems users in order to then improve the results of future queries. We benchmark the proposed framework on 1930s Census data, a collection of roughly 3.6 million forms and 7 billion individual units of information.
Keywords :
Big Data; Digitization; Indexing Text;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
High Performance Computing, Networking, Storage and Analysis (SCC), 2012 SC Companion:
Conference_Location :
Salt Lake City, UT
Print_ISBN :
978-1-4673-6218-4
Type :
conf
DOI :
10.1109/SC.Companion.2012.260
Filename :
6496043
Link To Document :
بازگشت