DocumentCode
2149594
Title
Mathematical Formula Identification in PDF Documents
Author
Lin, Xiaoyan ; Gao, Liangcai ; Tang, Zhi ; Lin, Xiaofan ; Hu, Xuan
Author_Institution
Inst. of Comput. Sci. & Technol., Peking Univ., Beijing, China
fYear
2011
fDate
18-21 Sept. 2011
Firstpage
1419
Lastpage
1423
Abstract
Recognizing mathematical expressions in PDF documents is a new and important field in document analysis. It is quite different from extracting mathematical expressions in image-based documents. In this paper, we propose a novel method by combining rule-based and learning-based methods to detect both isolated and embedded mathematical expressions in PDF documents. Moreover, various features of formulas, including geometric layout, character and context content, are used to adapt to a wide range of formula types. Experimental results show satisfactory performance of the proposed method. Furthermore, the method has been successfully incorporated into a commercial software package for large-scale Chinese e-Book production.
Keywords
document image processing; electronic publishing; knowledge based systems; learning (artificial intelligence); software packages; PDF documents; commercial software package; context content; document analysis; geometric layout; image based document; large-scale Chinese e-book production; learning based method; mathematical expression; mathematical formula identification; rule based method; Character recognition; Context; Feature extraction; Layout; Portable document format; Support vector machines; Text analysis; PDF document; formula extraction; mathematical expression recognition;
fLanguage
English
Publisher
ieee
Conference_Titel
Document Analysis and Recognition (ICDAR), 2011 International Conference on
Conference_Location
Beijing
ISSN
1520-5363
Print_ISBN
978-1-4577-1350-7
Electronic_ISBN
1520-5363
Type
conf
DOI
10.1109/ICDAR.2011.285
Filename
6065544
Link To Document