مرکز منطقه ای اطلاع رساني علوم و فناوري - Flexible approach for web information extraction based on HTMLParser

DocumentCode :

568132

Title :

Flexible approach for web information extraction based on HTMLParser

Author :

Shan, Lin ; Qun, Zhang

Author_Institution :

Sch. of Comput. Sci., Hubei Univ. of Technol., Wuhan, China

fYear :

2012

fDate :

14-17 July 2012

Firstpage :

683

Lastpage :

686

Abstract :

Nowadays Internet presents a huge amount of information for users. How to extract information quickly and effectively from various sources becomes very important. Web information extraction is the key element not only to Web crawler or search engine, but also for many specialized services such as competitive intelligence tools. This paper recommends a flexible and high-performance approach to the Web information extraction. HTMLParser is a parsing library mainly used to transform or extract the Web information with HTML. It uses Node, Abstract Node, and Tag to express HTML page. It can extract information mainly with two ways: filter and visitor. With HTMLParser, we can conveniently extract hyperlink, email, title, etc. In this paper, we also extend HTMLParser to extract custom tags in certain Web pages to expand its application area. Experimental results confirm the feasibility of the approach.

Keywords :

Internet; grammars; hypermedia markup languages; search engines; HTML Parser; Internet; Web crawler; Web information extraction; abstract node; parsing library; search engine; Crawlers; Data mining; HTML; Information filters; Matched filters; HTMLParser; Web crawler; custom tags; filter; information extraction; visitor;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Computer Science & Education (ICCSE), 2012 7th International Conference on

Conference_Location :

Melbourne, VIC

Print_ISBN :

978-1-4673-0241-8

Type :

conf

DOI :

10.1109/ICCSE.2012.6295166

Filename :

6295166

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=568132