مرکز منطقه ای اطلاع رساني علوم و فناوري - An information extraction system for heterogeneous Web source

DocumentCode :

2248202

Title :

An information extraction system for heterogeneous Web source

Author :

Zhou, Ting ; Sun, Cheng-jie ; Lin, Lei ; Liu, Bing-quan

Author_Institution :

MOE-MS Key Lab. of Natural Language Process. & Speech, Harbin Inst. of Technol., Harbin, China

Volume :

fYear :

2010

fDate :

11-14 July 2010

Firstpage :

3287

Lastpage :

3292

Abstract :

Information Extraction is the task of identifying information in texts and converting it into a predefined format. In this paper, we build an information integration system which focuses on the information of computer science teachers in Chinese universities. The target of the system is to automatically extract the useful information from heterogeneous sources and re-organize them into structured format. The system includes 4 main modules: web pages retrieval module, web pages´ structure classification module, information extraction module and information updating module. We have successfully applied the system to deal with 107 universities in China which shows the effect of the proposed system.

Keywords :

Web design; data mining; information analysis; Chinese universities; Web page retrieval module; Web page structure classification module; computer science teachers; heterogeneous Web source; information extraction system; information updating module; Classification algorithms; Crawlers; Data mining; Educational institutions; Search engines; Support vector machines; Web pages; Information Extraction; Topical crawler; Web Mining; Web page structure classification;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Machine Learning and Cybernetics (ICMLC), 2010 International Conference on

Conference_Location :

Qingdao

Print_ISBN :

978-1-4244-6526-2

Type :

conf

DOI :

10.1109/ICMLC.2010.5580698

Filename :

5580698

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=2248202