• DocumentCode
    3309679
  • Title

    Design and implementation of a web news extraction system

  • Author

    Hua-lin Xia ; Yang-sen Zhang

  • Author_Institution
    Inst. of Intell. Inf. Process., Beijing Inf. Sci. & Technol. Univ., Beijing, China
  • Volume
    3
  • fYear
    2011
  • fDate
    26-28 July 2011
  • Firstpage
    1793
  • Lastpage
    1797
  • Abstract
    With the widespread use of Internet and the development of information technology, there is a tremendous amount of news information resource. The ability to quickly obtain useful resource from the huge news information is a crucial problem at present. Based on the analysis of the structure of the news portal page, this paper combines the technology of regular expressions and HTML-Parser, introduces a general method of news and information automatically extracted, and realizes an efficient general news information extraction system. The system can not only extract the headlines, time released, text content rightly, but also can extract the news information relevant or similar to the subject.
  • Keywords
    Internet; Web sites; grammars; hypermedia markup languages; information retrieval; information technology; portals; HTML; Internet; Web news extraction; information extraction system; information resource; information technology; news portal page; parser; regular expressions; Accuracy; Data mining; Encoding; Indexes; Information services; Pattern matching; Web pages; Content Page; Index Page; Information Extraction; Regular Expressions;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Fuzzy Systems and Knowledge Discovery (FSKD), 2011 Eighth International Conference on
  • Conference_Location
    Shanghai
  • Print_ISBN
    978-1-61284-180-9
  • Type

    conf

  • DOI
    10.1109/FSKD.2011.6019812
  • Filename
    6019812