Title :
ITPilot: a toolkit for industrial-strength Web data extraction
Author :
Pan, Alberto ; Raposo, Juan ; Álvarez, Manuel ; Montoto, Paula ; Losada, José ; Hidalgo, Justo
Author_Institution :
A Coruna Univ., Spain
Abstract :
In recent years, many research systems have been proposed to perform data extraction and automation tasks on Web sources. Since most of today\´s Web sources are "human-readable" but not "machine-readable", these systems must address a number of difficult challenges, such as dealing with complex navigation sequences, extracting data from HTML pages and reacting to source changes. Denodo Corporation has developed ITPilot, an industrial-strength solution that allows complex "wrappers" for Web sources to be graphically generated and automatically maintained. This paper presents the architecture and the basic ideas "behind the scenes" in ITPilot.
Keywords :
Web sites; hypermedia markup languages; information retrieval; HTML page; ITPilot; Web data extraction; Web sources; Web wrapper; industrial-strength solution; Automation; Books; Computer architecture; Computer languages; Data mining; HTML; Java; Navigation; Web services; World Wide Web;
Conference_Titel :
Web Intelligence, 2005. Proceedings. The 2005 IEEE/WIC/ACM International Conference on
Print_ISBN :
0-7695-2415-X