Title :
Web Information Extraction Based on Hierarchical Model
Author :
Liu, Yaqing ; Chen, Rong ; Yang, Hong
Author_Institution :
Sch. of Inf. Sci. & Technol., Dalian Maritime Univ., Dalian, China
Abstract :
A hierarchical extraction model based on hidden Markov model is proposed after analyzing some existing algorithms used in the field of Web information extraction. We firstly annotate atom information items and compound information items in HTML documents and then use a bottomup clustering method to build a DOM+ tree. At last, we make use of the annotated information of atom information items and compound information items with the compound information items´ paths in DOM+ tree to build the hierarchical extraction model. Experiments show that we may get better performance by using hierarchical extraction model.
Keywords :
hidden Markov models; hypermedia markup languages; knowledge acquisition; DOM+ tree; HTML documents; Web information extraction; atom information items; bottomup clustering method; compound information items; hidden Markov model; hierarchical extraction model; Data mining; Database languages; HTML; Hidden Markov models; Induction generators; Information science; Mathematical model; Search engines; Web pages; World Wide Web;
Conference_Titel :
Computational Intelligence and Software Engineering, 2009. CiSE 2009. International Conference on
Conference_Location :
Wuhan
Print_ISBN :
978-1-4244-4507-3
Electronic_ISBN :
978-1-4244-4507-3
DOI :
10.1109/CISE.2009.5365870