DocumentCode :
568132
Title :
Flexible approach for web information extraction based on HTMLParser
Author :
Shan, Lin ; Qun, Zhang
Author_Institution :
Sch. of Comput. Sci., Hubei Univ. of Technol., Wuhan, China
fYear :
2012
fDate :
14-17 July 2012
Firstpage :
683
Lastpage :
686
Abstract :
Nowadays Internet presents a huge amount of information for users. How to extract information quickly and effectively from various sources becomes very important. Web information extraction is the key element not only to Web crawler or search engine, but also for many specialized services such as competitive intelligence tools. This paper recommends a flexible and high-performance approach to the Web information extraction. HTMLParser is a parsing library mainly used to transform or extract the Web information with HTML. It uses Node, Abstract Node, and Tag to express HTML page. It can extract information mainly with two ways: filter and visitor. With HTMLParser, we can conveniently extract hyperlink, email, title, etc. In this paper, we also extend HTMLParser to extract custom tags in certain Web pages to expand its application area. Experimental results confirm the feasibility of the approach.
Keywords :
Internet; grammars; hypermedia markup languages; search engines; HTML Parser; Internet; Web crawler; Web information extraction; abstract node; parsing library; search engine; Crawlers; Data mining; HTML; Information filters; Matched filters; HTMLParser; Web crawler; custom tags; filter; information extraction; visitor;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Computer Science & Education (ICCSE), 2012 7th International Conference on
Conference_Location :
Melbourne, VIC
Print_ISBN :
978-1-4673-0241-8
Type :
conf
DOI :
10.1109/ICCSE.2012.6295166
Filename :
6295166
Link To Document :
بازگشت