DocumentCode :
2489781
Title :
Web Information Extraction Using Generalized Hidden Markov Model
Author :
Zhong, Ping ; Chen, Jinlin ; Cook, Terry
Author_Institution :
Dept. of Comput. Sci., City Univ. of New York, NY
fYear :
2006
fDate :
13-14 Nov. 2006
Firstpage :
1
Lastpage :
8
Abstract :
Hidden Markov model (HMM) is an important approach for information extraction (IE). When applied to Web IE, several problems exist with HMM based approaches due to the lack of consideration on Web-specific features. In this paper we present a generalized hidden Markov model (GHMM) that extends traditional HMMs by making use of Web-specific information for Web IE. In our approach we use Web content block instead of term as basic extraction unit. Besides, instead of using the traditional sequential state transition order, we detect the state transition order of GHMM based on layout structure of the corresponding Web page. Furthermore, we use multiple emission features instead of single emission feature. In this way GHMM can better accommodate Web IE. Experiments show promising results comparing to traditional HMM based Web IE
Keywords :
Internet; hidden Markov models; information retrieval; Web content; Web information extraction; Web page; Web-specific features; generalized hidden Markov model; layout analysis; sequential state transition order; Cities and towns; Computer science; Data mining; Electronic mail; Hidden Markov models; Humans; Information analysis; Internet; Web pages; Web sites; Hidden Markov Model; Information extraction; Layout Analysis; Web;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Hot Topics in Web Systems and Technologies, 2006. HOTWEB '06. 1st IEEE Workshop on
Conference_Location :
Boston, MA
Print_ISBN :
1-4244-0596-3
Electronic_ISBN :
1-4244-0596-3
Type :
conf
DOI :
10.1109/HOTWEB.2006.355271
Filename :
4178388
Link To Document :
بازگشت