• DocumentCode
    2418754
  • Title

    Metadata Extraction from Chinese Research Papers Based on Conditional Random Fields

  • Author

    Yu, Jiangde ; Fan, Xiaozhong

  • Author_Institution
    Beijing Inst. of Technol., Beijing
  • Volume
    1
  • fYear
    2007
  • fDate
    24-27 Aug. 2007
  • Firstpage
    497
  • Lastpage
    501
  • Abstract
    With the appearance of more and more research papers on the Internet, it becomes more and more important to accurately extract the metadata from paper header and citation of research papers. In this paper, a method based on conditional random fields (CRFs) is proposed for automatic extraction of metadata from Chinese research papers. The key of this algorithm is parameter estimation and feature selection. We employ L-BFGS algorithm for parameter estimation. We analyze three classes of features and perform feature induction. In the processing the method makes use of the format information of list separators and special-labels to segment text, and then combines CRFs for metadata extraction from papers. We compare the performance of the metadata extracting on English and Chinese datasets using CRFs, also compare the performance of the different model: CRFs and hidden Markov model (HMM) on Chinese datasets. Experimental results show that CRFs perform better than HMM.
  • Keywords
    feature extraction; hidden Markov models; meta data; text analysis; Chinese datasets; Chinese research papers; Internet; conditional random fields; feature induction; feature selection; hidden Markov model; metadata extraction; parameter estimation; Citation analysis; Computer science; Data mining; Educational institutions; Hidden Markov models; Internet; Parameter estimation; Particle separators; Performance analysis; World Wide Web;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Fuzzy Systems and Knowledge Discovery, 2007. FSKD 2007. Fourth International Conference on
  • Conference_Location
    Haikou
  • Print_ISBN
    978-0-7695-2874-8
  • Type

    conf

  • DOI
    10.1109/FSKD.2007.394
  • Filename
    4405975