DocumentCode
1621079
Title
Page Digest for large-scale Web services
Author
Rocco, Daniel ; Buttler, David ; Liu, Ling
Author_Institution
Coll. of Comput., Georgia Inst. of Technol., Atlanta, GA, USA
fYear
2003
Firstpage
381
Lastpage
390
Abstract
We introduce Page Digest, a mechanism for efficient storage and processing of Web documents. The Page Digest design encourages a clean separation of the structural elements of Web documents from their content. Its encoding transformation produces many of the advantages of traditional string digest schemes yet remains invertible without introducing significant additional cost or complexity. Using the Page Digest encoding can provide at least an order of magnitude speedup when traversing a Web document as compared to using a standard document object model implementation. Our experiments show that change detection using Page Digest operates in linear time, offering 75% improvement in execution performance compared with existing systems. In addition, the Page Digest encoding can reduce the tag name redundancy found in Web documents, allowing 30% to 50% reduction in document size.
Keywords
Internet; abstracting; content management; document handling; information storage; HTML documents; Web document processing; Web document storage; change detection; content element; data management; document format; document layout; document object model implementation; document size reduction; encoding transformation; execution performance improvement; information collection; large scale Web service; linear time operation; magnitude speedup; page digest encoding; semantic information; string digest scheme; structural element separation; tag name redundancy; Costs; Educational institutions; Encoding; HTML; Knowledge management; Large-scale systems; Memory; Search engines; Web and internet services; Web services;
fLanguage
English
Publisher
ieee
Conference_Titel
E-Commerce, 2003. CEC 2003. IEEE International Conference on
Print_ISBN
0-7695-1969-5
Type
conf
DOI
10.1109/COEC.2003.1210274
Filename
1210274
Link To Document