DocumentCode
588761
Title
Information Extraction from Web Documents Based on Unranked Tree Automaton Inference
Author
Huang Zhaohua ; Yang Fan
Author_Institution
Sch. of Software, East China Jiao Tong Univ., Nanchang, China
fYear
2012
fDate
2-4 Nov. 2012
Firstpage
195
Lastpage
198
Abstract
Information extraction (IE) aims at extracting specific information from a collection of documents. A lot of previous work on IE from semi-structured documents (in XML or HTML) uses learning techniques based on strings. Some recent work converts the document to a ranked tree and uses tree automaton induction. This paper introduces an algorithm that uses unranked trees to induce an automaton. Experiments show that this gives the best results obtained so far for IE from semi-structured documents based on learning.
Keywords
Internet; XML; automata theory; document handling; inference mechanisms; information retrieval; learning (artificial intelligence); trees (mathematics); HTML; IE; Web documents; XML; document collection; information extraction; learning techniques; ranked tree; semi structured documents; tree automaton induction; unranked tree automaton inference; Multimedia communication; Security; (k; automaton; grammatical inference; information extraction; l) -contextual tree language;
fLanguage
English
Publisher
ieee
Conference_Titel
Multimedia Information Networking and Security (MINES), 2012 Fourth International Conference on
Conference_Location
Nanjing
Print_ISBN
978-1-4673-3093-0
Type
conf
DOI
10.1109/MINES.2012.128
Filename
6405661
Link To Document