• DocumentCode
    2789733
  • Title

    A Fast Template-Based Approach to Automatically Identify Primary Text Content of a Web Page

  • Author

    Nguyen, Dat Quoc ; Nguyen, Dai Quoc ; Pham, Son Bao ; Bui, The Duy

  • Author_Institution
    Human Machine Interaction Lab., Vietnam Nat. Univ., Hanoi, Vietnam
  • fYear
    2009
  • fDate
    13-17 Oct. 2009
  • Firstpage
    232
  • Lastpage
    236
  • Abstract
    Search engines have become an indispensable tool for browsing information on the Internet. The user, however, is often annoyed by redundant results from irrelevant Web pages. One reason is because search engines also look at non-informative blocks of Web pages such as advertisement, navigation links, etc. In this paper, we propose a fast algorithm called FastContentExtractor to automatically detect main content blocks in a Web page by improving the ContentExtractor algorithm. By automatically identifying and storing templates representing the structure of content blocks in a website, content blocks of a new Web page from the Website can be extracted quickly. The hierarchical order of the output blocks is also maintained which guarantees that the extracted content blocks are in the same order as the original ones.
  • Keywords
    Internet; Web sites; information retrieval; search engines; ContentExtractor algorithm; FastContentExtractor; Internet; Web page; Websites; fast template-based approach; information browsing; primary text content; search engines; templates storing; Data mining; Educational institutions; Humans; Internet; Knowledge engineering; Laboratories; Navigation; Search engines; Systems engineering and theory; Web pages; data mining; template detection; web mining;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Knowledge and Systems Engineering, 2009. KSE '09. International Conference on
  • Conference_Location
    Hanoi
  • Print_ISBN
    978-1-4244-5086-2
  • Electronic_ISBN
    978-0-7695-3846-4
  • Type

    conf

  • DOI
    10.1109/KSE.2009.39
  • Filename
    5361702