Title :
Approximate matching for OCR-processed bibliographic data
Author :
Takasu, Atsuhiro ; KATAYAMA, Norio ; Yamaoka, Masaki ; Iwaki, Osamu ; Oyama, Keizo ; Adachi, Jun
Author_Institution :
Res. & Dev. Dept., Nat. Center for Sci. Inf. Syst., Tokyo, Japan
Abstract :
This paper presents a method for matching bibliographies in references of academic papers obtained as document images with records of bibliographic databases. The main subject of this paper is to handle the erroneous bibliographic data obtained by a document understanding methodology. The presented method can find a candidate record set from referral databases in spite of the errors of string by means of approximate matching which is performed as an exact matching of k substrings of length m chosen from the strings of bibliographic data in references and in databases. For the accuracy α of the OCR, theoretical observation shows that the accuracy of the presented method is 1-(1-αm)k under the assumption that the OCR error occurs randomly and independently in the string. The method is applied to references of 187 Japanese articles and achieves accuracy of 94.05%
Keywords :
bibliographic systems; optical character recognition; visual databases; Japanese articles; OCR-processed bibliographic data; academic papers; approximate matching; bibliographies; document images; referral databases; Character recognition; Data communication; Data mining; Image analysis; Image databases; Information systems; Information technology; Laboratories; Optical character recognition software; Text analysis;
Conference_Titel :
Pattern Recognition, 1996., Proceedings of the 13th International Conference on
Conference_Location :
Vienna
Print_ISBN :
0-8186-7282-X
DOI :
10.1109/ICPR.1996.546933