• DocumentCode
    2167468
  • Title

    Information extraction from semi-structured and un-structured documents using probabilistic context free grammar inference

  • Author

    Thakur, Ramesh ; Jain, Suresh ; Chaudhari, Narendra S. ; Singhai, Rahul

  • Author_Institution
    Int. Inst. of Prof. Studies, Devi Ahilya Viswavidyalaya, Indore, India
  • fYear
    2012
  • fDate
    13-15 March 2012
  • Firstpage
    273
  • Lastpage
    276
  • Abstract
    Large number of research papers are available in the form of un-structured (text) format. Knowledge discovery in un-structured document has been recognized as promising task. These documents are typically formatted for human viewing, which varies widely from document to document. Frequent change in their formatting causes difficulties in constructing a global schema. Thus, discovery of interesting rules from it is a complex and tedious process. Recently, conditional random fields (CRFs) and hand-coded wrappers have been used to label the text (such as Title, Author Name(s), Affiliation, Email, Contact number, etc. in research papers). In this paper we propose a novel hybrid approach to infer grammar rules using alignment similarity and probabilistic context free grammar. It helps in extracting desired information from the document.
  • Keywords
    context-free grammars; data mining; inference mechanisms; information retrieval; probability; text analysis; CRF; alignment similarity; conditional random fields; document formatting; grammar rules; hand-coded wrappers; hybrid approach; information extraction; knowledge discovery; probabilistic context free grammar inference; research papers; semistructured documents; text labeling; unstructured documents; unstructured text format; Abstracts; Context; Data mining; Feature extraction; Grammar; Inference algorithms; Probabilistic logic; Alignment profile; Information extraction; Knowledge discovery; Learning systems; grammar inference; sequence mining;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Information Retrieval & Knowledge Management (CAMP), 2012 International Conference on
  • Conference_Location
    Kuala Lumpur
  • Print_ISBN
    978-1-4673-1091-8
  • Type

    conf

  • DOI
    10.1109/InfRKM.2012.6204988
  • Filename
    6204988