• DocumentCode
    1605165
  • Title

    A system for the identification of multi-page Web documents

  • Author

    Sweet, J. ; Harrington, S. ; Jones, R. Price ; Savakis, A. ; Naveda, F. ; Roetling, P.

  • Author_Institution
    Xerox Corp., Webster, NY, USA
  • Volume
    2
  • fYear
    2004
  • Firstpage
    887
  • Abstract
    When a World Wide Web (WWW) document spans multiple Web pages, it is often inconvenient to print or download the entire document using available tools. A two-phase iterative approach has been developed for the automated identification of pages residing within the same document boundary, given a starting uniform resource locator (URL) as input. This system was applied to a test suite of 98 Web documents, and the results were compared to a ground truth document boundary in each case. Using a set intersection metric, an overall success rate of 73 % was achieved. This is a significant improvement over existing tools, which are not fully automated, and can achieve a success rate of only 61 % with user assistance.
  • Keywords
    Web sites; document handling; URL; WWW; Web pages; World Wide Web; automated identification; multi-page Web documents; two-phase iterative approach; uniform resource locator; Citation analysis; Iterative algorithms; Iterative methods; Navigation; System testing; Text analysis; Uniform resource locators; Web pages; Web sites; World Wide Web;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Electrical and Computer Engineering, 2004. Canadian Conference on
  • ISSN
    0840-7789
  • Print_ISBN
    0-7803-8253-6
  • Type

    conf

  • DOI
    10.1109/CCECE.2004.1345257
  • Filename
    1345257