DocumentCode
3195875
Title
Automatic repairing of Web wrappers by combining redundant views
Author
Chidlovskii, Boris
Author_Institution
Xerox Res. Centre Eur., Meylan, France
fYear
2002
fDate
2002
Firstpage
399
Lastpage
406
Abstract
We address the problem of automatic maintenance of Web wrappers used in data integration systems to encapsulate an access to Web information providers. The maintenance of Web wrappers is critical as providers often changes the page format and/or structure making wrappers inoperable. The solution we propose extends the conventional wrapper architecture with a novel component of automatic maintenance and recovery. We consider the automatic recovery as special type of the classification problem and use ensemble methods of machine learning to build alternative views of provider pages. We combine extraction rules of conventional wrappers with content features of extracted information to accurate recovery from three types of format changes, namely, content, context and structural changes. We report results of the recovery performance for format changes at widely used Web providers.
Keywords
Web sites; data mining; learning (artificial intelligence); pattern classification; search engines; Web wrapper repairing; automatic recovery; content classification; context extraction rules; data integration systems; information extraction recovery; machine learning; page format; Application software; Data mining; Electronic switching systems; Europe; HTML; Humans; Information resources; Maintenance; Transducers; Web sites;
fLanguage
English
Publisher
ieee
Conference_Titel
Tools with Artificial Intelligence, 2002. (ICTAI 2002). Proceedings. 14th IEEE International Conference on
ISSN
1082-3409
Print_ISBN
0-7695-1849-4
Type
conf
DOI
10.1109/TAI.2002.1180831
Filename
1180831
Link To Document