• DocumentCode
    1806909
  • Title

    An Approach of Extracting Web Information Based on HTMLParser

  • Author

    Lin, Shan ; Hu, Yanzhong

  • Author_Institution
    Sch. of Comput. Sci. & Technol., Hubei Inst. of Technol., Wuhan, China
  • fYear
    2010
  • fDate
    24-25 July 2010
  • Firstpage
    284
  • Lastpage
    287
  • Abstract
    Now many applications need to analyze various detail contents of web pages. How to extract web information quickly and effectively becomes very important. Web information is primarily expressed by HTML. HTMLParser is an open project of SourceForge.net and can parse HTML in either a linear or a nested fashion. This paper analyzes the principle of extracting web information based on HTMLParser. In addition, it gives an approach of implementing web information extraction with the classes and methods provided by HTMLParser. At last, we demonstrate the detailed process of web information extraction by an example.
  • Keywords
    Internet; data handling; program compilers; HTMLParser; SourceForge.net project; Web information extraction; linear parsing; nested parsing; Data mining; Filtering theory; HTML; Information filters; Transforms; Web pages; HTMLParser; filter; visitor; web information extraction;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Information Technology and Computer Science (ITCS), 2010 Second International Conference on
  • Conference_Location
    Kiev
  • Print_ISBN
    978-1-4244-7293-2
  • Electronic_ISBN
    978-1-4244-7294-9
  • Type

    conf

  • DOI
    10.1109/ITCS.2010.76
  • Filename
    5557131