• DocumentCode
    2149594
  • Title

    Mathematical Formula Identification in PDF Documents

  • Author

    Lin, Xiaoyan ; Gao, Liangcai ; Tang, Zhi ; Lin, Xiaofan ; Hu, Xuan

  • Author_Institution
    Inst. of Comput. Sci. & Technol., Peking Univ., Beijing, China
  • fYear
    2011
  • fDate
    18-21 Sept. 2011
  • Firstpage
    1419
  • Lastpage
    1423
  • Abstract
    Recognizing mathematical expressions in PDF documents is a new and important field in document analysis. It is quite different from extracting mathematical expressions in image-based documents. In this paper, we propose a novel method by combining rule-based and learning-based methods to detect both isolated and embedded mathematical expressions in PDF documents. Moreover, various features of formulas, including geometric layout, character and context content, are used to adapt to a wide range of formula types. Experimental results show satisfactory performance of the proposed method. Furthermore, the method has been successfully incorporated into a commercial software package for large-scale Chinese e-Book production.
  • Keywords
    document image processing; electronic publishing; knowledge based systems; learning (artificial intelligence); software packages; PDF documents; commercial software package; context content; document analysis; geometric layout; image based document; large-scale Chinese e-book production; learning based method; mathematical expression; mathematical formula identification; rule based method; Character recognition; Context; Feature extraction; Layout; Portable document format; Support vector machines; Text analysis; PDF document; formula extraction; mathematical expression recognition;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Document Analysis and Recognition (ICDAR), 2011 International Conference on
  • Conference_Location
    Beijing
  • ISSN
    1520-5363
  • Print_ISBN
    978-1-4577-1350-7
  • Electronic_ISBN
    1520-5363
  • Type

    conf

  • DOI
    10.1109/ICDAR.2011.285
  • Filename
    6065544