Title :
Bibliographic attribute extraction from erroneous references based on a statistical model
Author :
Takasu, Atsuhiro
Author_Institution :
Nat. Inst. of Informatics, Tokyo, Japan
Abstract :
We propose a method for extracting bibliographic attributes from reference strings captured using optical character recognition (OCR) and an extended hidden Markov model. Bibliographic attribute extraction can be used in two ways. One is reference parsing in which attribute values are extracted from OCR-processed references for bibliographic matching. The other is reference alignment in which attribute values are aligned to the bibliographic record to enrich the vocabulary of the bibliographic database. We first propose a statistical model for attribute extraction that represents both the syntactical structure of references and OCR error patterns. Then, we perform experiments using bibliographic references obtained from scanned images of papers in journals and transactions and show that useful attribute values are extracted from OCR-processed references. We also show that the proposed model has advantages in reducing the cost of preparing training data, a critical problem in rule-based systems.
Keywords :
bibliographic systems; citation analysis; digital libraries; error analysis; grammars; hidden Markov models; information retrieval; optical character recognition; string matching; very large databases; vocabulary; OCR; OCR error pattern; OCR-processed reference; bibliographic attribute extraction; bibliographic database; bibliographic matching; bibliographic reference string; extended hidden Markov model; optical character recognition; reference alignment; reference parsing; reference syntactical structure; statistical model; training data set; vocabulary; Character recognition; Costs; Data mining; Hidden Markov models; Image databases; Optical character recognition software; Optical recording; Training data; Transaction databases; Vocabulary;
Conference_Titel :
Digital Libraries, 2003. Proceedings. 2003 Joint Conference on
Print_ISBN :
0-7695-1939-3
DOI :
10.1109/JCDL.2003.1204843