مرکز منطقه ای اطلاع رساني علوم و فناوري - Extracting company information from the web

DocumentCode :

2581191

Title :

Extracting company information from the web

Author :

Lam, Man I. ; Gong, Zhiguo ; Guo, Jingzhi

Author_Institution :

Fac. of Sci. & Technol., Univ. of Macau, Macao, China

fYear :

2009

fDate :

11-14 Oct. 2009

Firstpage :

3640

Lastpage :

3645

Abstract :

As World Wide Web is becoming the most important information repository, increasing amount of information is available. Currently, web search engines can only provide document oriented searches. In order to fully make use of information from the web, some effective and efficient extraction algorithms are definitely desirable. In this paper, some existing achievements are investigated firstly. Then our current technique on web information extraction is discussed in detail. In our approach, rules and patterns are extracted from sample pages through training process, with human involvements. We use both keywords and regular expressions to represent rules and patterns in our system. The keywords work as anchors to locate the positions of the potential information and regular expressions work as validations of the values. In our system, all the extracted information is represented in XML format.

Keywords :

Internet; human factors; information retrieval; Web search engines; World Wide Web; XML format; company information extraction; human involvements; information repository; keywords; training process; Cybernetics; Data mining; Database languages; HTML; Humans; Internet; Search engines; Web pages; Web sites; XML;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Systems, Man and Cybernetics, 2009. SMC 2009. IEEE International Conference on

Conference_Location :

San Antonio, TX

ISSN :

1062-922X

Print_ISBN :

978-1-4244-2793-2

Electronic_ISBN :

1062-922X

Type :

conf

DOI :

10.1109/ICSMC.2009.5346863

Filename :

5346863

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=2581191