• DocumentCode
    588761
  • Title

    Information Extraction from Web Documents Based on Unranked Tree Automaton Inference

  • Author

    Huang Zhaohua ; Yang Fan

  • Author_Institution
    Sch. of Software, East China Jiao Tong Univ., Nanchang, China
  • fYear
    2012
  • fDate
    2-4 Nov. 2012
  • Firstpage
    195
  • Lastpage
    198
  • Abstract
    Information extraction (IE) aims at extracting specific information from a collection of documents. A lot of previous work on IE from semi-structured documents (in XML or HTML) uses learning techniques based on strings. Some recent work converts the document to a ranked tree and uses tree automaton induction. This paper introduces an algorithm that uses unranked trees to induce an automaton. Experiments show that this gives the best results obtained so far for IE from semi-structured documents based on learning.
  • Keywords
    Internet; XML; automata theory; document handling; inference mechanisms; information retrieval; learning (artificial intelligence); trees (mathematics); HTML; IE; Web documents; XML; document collection; information extraction; learning techniques; ranked tree; semi structured documents; tree automaton induction; unranked tree automaton inference; Multimedia communication; Security; (k; automaton; grammatical inference; information extraction; l) -contextual tree language;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Multimedia Information Networking and Security (MINES), 2012 Fourth International Conference on
  • Conference_Location
    Nanjing
  • Print_ISBN
    978-1-4673-3093-0
  • Type

    conf

  • DOI
    10.1109/MINES.2012.128
  • Filename
    6405661