• DocumentCode
    2850753
  • Title

    Hidden Markov Models and Text Classifiers for Information Extraction on Semi-Structured Texts

  • Author

    Barros, Flavia A. ; Silva, Eduardo F A ; Prudencio, Ricardo B. C. ; Filho, Valmir M. ; Nascimento, André C A

  • Author_Institution
    Center of Inf., Fed. Univ. of Pernambuco, Recife
  • fYear
    2008
  • fDate
    10-12 Sept. 2008
  • Firstpage
    417
  • Lastpage
    422
  • Abstract
    Information extraction (IE) aims to extract from textual documents only the fragments which correspond to datafields required by the user. In this paper, we present new experiments evaluating a hybrid machine learning approach for IE that combines text classifiers and hidden Markov models (HMM). In this approach, a text classifier technique generates an initial output, which is refined by an HMM, taking into account dependences in the order of the data to be extracted. The proposal was evaluated to extract information from bibliographic references. Experiments performed on a corpus of 6000 references have shown an improvement in performance compared to benchmarking IE approaches adopted in previous work.
  • Keywords
    bibliographic systems; hidden Markov models; information retrieval; learning (artificial intelligence); pattern classification; text analysis; bibliographic references; hidden Markov models; hybrid machine learning approach; information extraction; semistructured texts; text classifiers; textual documents extraction; Data mining; Hidden Markov models; Hybrid intelligent systems; Informatics; Information retrieval; Machine learning; Proposals; Text categorization; Web search; Web sites; Hidden Markov Models; Information Extraction; Text Classifiers;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Hybrid Intelligent Systems, 2008. HIS '08. Eighth International Conference on
  • Conference_Location
    Barcelona
  • Print_ISBN
    978-0-7695-3326-1
  • Electronic_ISBN
    978-0-7695-3326-1
  • Type

    conf

  • DOI
    10.1109/HIS.2008.63
  • Filename
    4626665