DocumentCode :
1843215
Title :
Empirical evaluation of active sampling for CRF-based analysis of pages
Author :
Ohta, Manabu ; Inoue, Ryohei ; Takasu, Atsuhiro
Author_Institution :
Okayama Univ., Okayama, Japan
fYear :
2010
fDate :
4-6 Aug. 2010
Firstpage :
13
Lastpage :
18
Abstract :
We propose an automatic method of extracting bibliographies for academic articles scanned with OCR markup. The method uses conditional random fields (CRF) for labeling serially OCR-ed text lines on an article´s title page as appropriate names for bibliographic elements. Although we achieved excellent extraction accuracies for some Japanese academic journals, we needed a substantial amount of training data that had to be obtained through costly manual extraction of bibliographies from printed documents. Therefore, this paper reports an empirical evaluation of active sampling applied to the CRF-based extraction of bibliographies to reduce the amount of training data. We applied active sampling techniques to three academic journals published in Japan. The experiments revealed that the sampling strategy using the proposed criteria for selecting samples could reduce the amount of training data to less than half or even a third of those for two academic journals. This paper also reports the effect of pseudo-training data that were added to training.
Keywords :
academic libraries; bibliographies; digital libraries; document handling; probability; CRF-based analysis; Japanese academic journals; active sampling; bibliographic elements; bibliographies extraction; conditional random fields; empirical evaluation; Accuracy; Data mining; Hidden Markov models; Layout; Optical character recognition software; Training; Training data; Active sampling; Bibliography extraction; CRF; Digital library; OCR;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Information Reuse and Integration (IRI), 2010 IEEE International Conference on
Conference_Location :
Las Vegas, NV
Print_ISBN :
978-1-4244-8097-5
Type :
conf
DOI :
10.1109/IRI.2010.5558973
Filename :
5558973
Link To Document :
بازگشت