DocumentCode
1806909
Title
An Approach of Extracting Web Information Based on HTMLParser
Author
Lin, Shan ; Hu, Yanzhong
Author_Institution
Sch. of Comput. Sci. & Technol., Hubei Inst. of Technol., Wuhan, China
fYear
2010
fDate
24-25 July 2010
Firstpage
284
Lastpage
287
Abstract
Now many applications need to analyze various detail contents of web pages. How to extract web information quickly and effectively becomes very important. Web information is primarily expressed by HTML. HTMLParser is an open project of SourceForge.net and can parse HTML in either a linear or a nested fashion. This paper analyzes the principle of extracting web information based on HTMLParser. In addition, it gives an approach of implementing web information extraction with the classes and methods provided by HTMLParser. At last, we demonstrate the detailed process of web information extraction by an example.
Keywords
Internet; data handling; program compilers; HTMLParser; SourceForge.net project; Web information extraction; linear parsing; nested parsing; Data mining; Filtering theory; HTML; Information filters; Transforms; Web pages; HTMLParser; filter; visitor; web information extraction;
fLanguage
English
Publisher
ieee
Conference_Titel
Information Technology and Computer Science (ITCS), 2010 Second International Conference on
Conference_Location
Kiev
Print_ISBN
978-1-4244-7293-2
Electronic_ISBN
978-1-4244-7294-9
Type
conf
DOI
10.1109/ITCS.2010.76
Filename
5557131
Link To Document