DocumentCode
1945295
Title
DeepWeb Navigation in Web Data Extraction
Author
Baumgartner, Robert ; Ledermiiller, G.
Author_Institution
Lixto Software GmbH, Technische Univ. Wien, Vienna
Volume
2
fYear
2005
fDate
28-30 Nov. 2005
Firstpage
698
Lastpage
703
Abstract
In literature, data extraction techniques for HTML and semi-structured data in general have been exhaustively studied and a number of automatic and semi-automatic approaches proposed. However, in real-life scenarios data extraction capabilities are only one half of the game. Password-protected sites, cookies, non-HTML data formats, JavaScript, session IDs, Web form iterations and dynamic changes on Web sites are the obstacles that make Web data extraction difficult in real-life application scenarios. We propose, based on current Lixto technology, a novel approach that introduces action-based Web navigation sequence recording and replaying and its close integration with extraction technologies. On the one hand, the technical innovation is the embedding of the Mozilla browser into the Lixto visual wrapper with the advantage of the support of a large number of Web standards and an open-source API to permit close interaction of Lixto with Mozilla. On the other hand, we develop a navigation language and explore its close interaction with Elog, the extraction language of Lixto. Current research status and sample screenshots are given. The paper closes with a description of two application domains where deep Web navigation capabilities play a crucial role, that is automotive B2B Web platforms and business intelligence scenarios
Keywords
Internet; application program interfaces; business data processing; competitive intelligence; information retrieval; online front-ends; JavaScript; Lixto technology; Lixto visual wrapper; Mozilla browser; Web data extraction; Web navigation; business intelligence; non-HTML data format; open-source API; password-protected sites; Automotive engineering; Data mining; HTML; Intelligent vehicles; Intrusion detection; Java; Navigation; Open source software; Technological innovation; Vehicle dynamics;
fLanguage
English
Publisher
ieee
Conference_Titel
Computational Intelligence for Modelling, Control and Automation, 2005 and International Conference on Intelligent Agents, Web Technologies and Internet Commerce, International Conference on
Conference_Location
Vienna
Print_ISBN
0-7695-2504-0
Type
conf
DOI
10.1109/CIMCA.2005.1631550
Filename
1631550
Link To Document