• DocumentCode
    1621079
  • Title

    Page Digest for large-scale Web services

  • Author

    Rocco, Daniel ; Buttler, David ; Liu, Ling

  • Author_Institution
    Coll. of Comput., Georgia Inst. of Technol., Atlanta, GA, USA
  • fYear
    2003
  • Firstpage
    381
  • Lastpage
    390
  • Abstract
    We introduce Page Digest, a mechanism for efficient storage and processing of Web documents. The Page Digest design encourages a clean separation of the structural elements of Web documents from their content. Its encoding transformation produces many of the advantages of traditional string digest schemes yet remains invertible without introducing significant additional cost or complexity. Using the Page Digest encoding can provide at least an order of magnitude speedup when traversing a Web document as compared to using a standard document object model implementation. Our experiments show that change detection using Page Digest operates in linear time, offering 75% improvement in execution performance compared with existing systems. In addition, the Page Digest encoding can reduce the tag name redundancy found in Web documents, allowing 30% to 50% reduction in document size.
  • Keywords
    Internet; abstracting; content management; document handling; information storage; HTML documents; Web document processing; Web document storage; change detection; content element; data management; document format; document layout; document object model implementation; document size reduction; encoding transformation; execution performance improvement; information collection; large scale Web service; linear time operation; magnitude speedup; page digest encoding; semantic information; string digest scheme; structural element separation; tag name redundancy; Costs; Educational institutions; Encoding; HTML; Knowledge management; Large-scale systems; Memory; Search engines; Web and internet services; Web services;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    E-Commerce, 2003. CEC 2003. IEEE International Conference on
  • Print_ISBN
    0-7695-1969-5
  • Type

    conf

  • DOI
    10.1109/COEC.2003.1210274
  • Filename
    1210274