Title :
An information extraction system for heterogeneous Web source
Author :
Zhou, Ting ; Sun, Cheng-jie ; Lin, Lei ; Liu, Bing-quan
Author_Institution :
MOE-MS Key Lab. of Natural Language Process. & Speech, Harbin Inst. of Technol., Harbin, China
Abstract :
Information Extraction is the task of identifying information in texts and converting it into a predefined format. In this paper, we build an information integration system which focuses on the information of computer science teachers in Chinese universities. The target of the system is to automatically extract the useful information from heterogeneous sources and re-organize them into structured format. The system includes 4 main modules: web pages retrieval module, web pages´ structure classification module, information extraction module and information updating module. We have successfully applied the system to deal with 107 universities in China which shows the effect of the proposed system.
Keywords :
Web design; data mining; information analysis; Chinese universities; Web page retrieval module; Web page structure classification module; computer science teachers; heterogeneous Web source; information extraction system; information updating module; Classification algorithms; Crawlers; Data mining; Educational institutions; Search engines; Support vector machines; Web pages; Information Extraction; Topical crawler; Web Mining; Web page structure classification;
Conference_Titel :
Machine Learning and Cybernetics (ICMLC), 2010 International Conference on
Conference_Location :
Qingdao
Print_ISBN :
978-1-4244-6526-2
DOI :
10.1109/ICMLC.2010.5580698