• DocumentCode
    2223604
  • Title

    Incremental Web Page Template Detection by Text Segments

  • Author

    Wang, Yu ; Fang, Bingxing ; Cheng, Xueqi ; Guo, Li ; Xu, Hongbo

  • Author_Institution
    Inst. of Comput. Technol., Chinese Acad. of Sci., Beijing
  • fYear
    2008
  • fDate
    14-15 July 2008
  • Firstpage
    174
  • Lastpage
    180
  • Abstract
    Template detection technique is important for many applications. Most template detection methods utilize content repetition as a hint to detect template blocks that lots of Web pages are required as input. So they usually process Web pages in batches that a newly crawled page can not be processed until enough pages are collected. This consumes large storage consumption to cache Web pages and results in a huge delay in data refreshing. In this paper, we present an incremental framework to detect templates in which a page is processed as soon as it has been crawled. Under this framework, we donpsilat need to cache any Web page. Experiments show that our framework consumes less than 7% storage than traditional methods. And also the delay of data refreshing induced by the batch process is completely eliminated.
  • Keywords
    Internet; text analysis; Web pages; incremental Web page template detection; text segments; Bars; Cache storage; Computers; Conferences; Degradation; Delay; Feeds; Navigation; Search engines; Web pages;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Semantic Computing and Systems, 2008. WSCS '08. IEEE International Workshop on
  • Conference_Location
    Huangshan
  • Print_ISBN
    978-0-7695-3316-2
  • Electronic_ISBN
    978-0-7695-3316-2
  • Type

    conf

  • DOI
    10.1109/WSCS.2008.17
  • Filename
    4570835