DocumentCode :
2223417
Title :
Learning knowledge bases for information extraction from multiple text based Web sites
Author :
Gao, Xiaoying ; Zhang, Mengjie
Author_Institution :
Sch. of Math. & Comput. Sci., Victoria Univ. of Wellington, New Zealand
fYear :
2003
fDate :
13-16 Oct. 2003
Firstpage :
119
Lastpage :
125
Abstract :
We describe a learning approach to automatically building knowledge bases for information extraction from multiple text based Web pages. A frame based representation is introduced to represent domain knowledge as knowledge unit frames. A frame learning algorithm is developed to automatically learn knowledge unit frames from training examples. Some training examples can be obtained by automatically parsing a number of tabular Web pages in the same domain, which greatly reduced the time consuming manual work. This approach was investigated on ten Web sites of real estate advertisements and car advertisements and nearly all the information was successfully extracted with very few false alarms. These results suggest that both the knowledge unit frame representation and the frame learning algorithm work well, domain specific knowledge base can be learned from training examples, and the domain specific knowledge base can be used for information extraction from flexible text-based semi-structured Web pages on multiple Web sites.
Keywords :
Web sites; frame based representation; information retrieval; knowledge based systems; text analysis; automatic parsing; car advertisement; domain knowledge; domain specific knowledge base; frame learning algorithm; frame-based representation; information extraction; knowledge unit frame; learning knowledge bases; multiple text-based Web sites; real estate advertisement; tabular Web page; text based semi-structured Web page; Buildings; Data mining; Intelligent agent; Testing; Web pages;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Intelligent Agent Technology, 2003. IAT 2003. IEEE/WIC International Conference on
Print_ISBN :
0-7695-1931-8
Type :
conf
DOI :
10.1109/IAT.2003.1241057
Filename :
1241057
Link To Document :
بازگشت