Title :
DeepWeb Navigation in Web Data Extraction
Author :
Baumgartner, Robert ; Ledermiiller, G.
Author_Institution :
Lixto Software GmbH, Technische Univ. Wien, Vienna
Abstract :
In literature, data extraction techniques for HTML and semi-structured data in general have been exhaustively studied and a number of automatic and semi-automatic approaches proposed. However, in real-life scenarios data extraction capabilities are only one half of the game. Password-protected sites, cookies, non-HTML data formats, JavaScript, session IDs, Web form iterations and dynamic changes on Web sites are the obstacles that make Web data extraction difficult in real-life application scenarios. We propose, based on current Lixto technology, a novel approach that introduces action-based Web navigation sequence recording and replaying and its close integration with extraction technologies. On the one hand, the technical innovation is the embedding of the Mozilla browser into the Lixto visual wrapper with the advantage of the support of a large number of Web standards and an open-source API to permit close interaction of Lixto with Mozilla. On the other hand, we develop a navigation language and explore its close interaction with Elog, the extraction language of Lixto. Current research status and sample screenshots are given. The paper closes with a description of two application domains where deep Web navigation capabilities play a crucial role, that is automotive B2B Web platforms and business intelligence scenarios
Keywords :
Internet; application program interfaces; business data processing; competitive intelligence; information retrieval; online front-ends; JavaScript; Lixto technology; Lixto visual wrapper; Mozilla browser; Web data extraction; Web navigation; business intelligence; non-HTML data format; open-source API; password-protected sites; Automotive engineering; Data mining; HTML; Intelligent vehicles; Intrusion detection; Java; Navigation; Open source software; Technological innovation; Vehicle dynamics;
Conference_Titel :
Computational Intelligence for Modelling, Control and Automation, 2005 and International Conference on Intelligent Agents, Web Technologies and Internet Commerce, International Conference on
Conference_Location :
Vienna
Print_ISBN :
0-7695-2504-0
DOI :
10.1109/CIMCA.2005.1631550