Title :
Accurate and Efficient HTML Differencing
Author :
Mikhaiel, Rimon ; Stroulia, Eleni
Author_Institution :
Dept. of Comput. Sci., Alberta Univ., Edmonton, Alta.
Abstract :
Recognizing the differences between subsequent versions of HTML documents is an important problem. It is useful for managers of multi-authored Web sites who need to review and approve the changes to their Web-site content. It is also necessary for users who want to be able to easily recognize changes to the pages they visit regularly. Comparing HTML documents at the lexical level, as if they were regular text documents, is neither informative nor intuitive. Instead, their internal tree structure has to be taken into account. In this paper, we discuss VDiff an algorithm we have developed for HTML differencing, based on the Zhang-Shasha tree-edit distance algorithm. Our algorithm reports which nodes in the two compared documents match, have been deleted (inserted) from(in) the original (subsequent) document, or have been, moved in the HTML structure. We have evaluated the accuracy and performance of our algorithm with a case study
Keywords :
Web sites; content management; hypermedia markup languages; text analysis; tree data structures; HTML differencing; HTML document versions; HTML structure; VDiff algorithm; Web site content; Zhang-Shasha tree-edit distance algorithm; document comparison; document matching; multiauthored Web sites; page changes; text documents; tree structure; Algorithm design and analysis; Content management; Crawlers; Data mining; HTML; Runtime; Testing; Tree data structures; Visualization; Wrapping;
Conference_Titel :
Software Technology and Engineering Practice, 2005. 13th IEEE International Workshop on
Conference_Location :
Budapest
Print_ISBN :
0-7695-2639-X
DOI :
10.1109/STEP.2005.7