DocumentCode :
2248202
Title :
An information extraction system for heterogeneous Web source
Author :
Zhou, Ting ; Sun, Cheng-jie ; Lin, Lei ; Liu, Bing-quan
Author_Institution :
MOE-MS Key Lab. of Natural Language Process. & Speech, Harbin Inst. of Technol., Harbin, China
Volume :
6
fYear :
2010
fDate :
11-14 July 2010
Firstpage :
3287
Lastpage :
3292
Abstract :
Information Extraction is the task of identifying information in texts and converting it into a predefined format. In this paper, we build an information integration system which focuses on the information of computer science teachers in Chinese universities. The target of the system is to automatically extract the useful information from heterogeneous sources and re-organize them into structured format. The system includes 4 main modules: web pages retrieval module, web pages´ structure classification module, information extraction module and information updating module. We have successfully applied the system to deal with 107 universities in China which shows the effect of the proposed system.
Keywords :
Web design; data mining; information analysis; Chinese universities; Web page retrieval module; Web page structure classification module; computer science teachers; heterogeneous Web source; information extraction system; information updating module; Classification algorithms; Crawlers; Data mining; Educational institutions; Search engines; Support vector machines; Web pages; Information Extraction; Topical crawler; Web Mining; Web page structure classification;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Machine Learning and Cybernetics (ICMLC), 2010 International Conference on
Conference_Location :
Qingdao
Print_ISBN :
978-1-4244-6526-2
Type :
conf
DOI :
10.1109/ICMLC.2010.5580698
Filename :
5580698
Link To Document :
بازگشت