DocumentCode :
2349975
Title :
Crawling programs for wrapper-based applications
Author :
Bertoli, Claudio ; Crescenzi, Valter ; Merialdo, Paolo
Author_Institution :
UniversitÃ\xa0 degli Studi Roma Tre, Dipartimento di Informatica e Automazione, Via della Vasca Navale, 79 - 00146 Rome, Italy
fYear :
2008
fDate :
13-15 July 2008
Firstpage :
160
Lastpage :
165
Abstract :
Many large web sites provide pages containing highly valuable data. In order to extract data from these pages several methods and techniques have been developed to generate web wrappers, that is, programs that convert into a structured format the data embedded into HTML pages. These techniques easy the burden of writing applications that make reuse of data from the web. However the generation of wrappers is just one of the ingredients needed to the development of such applications. A necessary yet underestimated task is that of developing programs for driving a crawler towards the pages that contain the target data. We present a method and an associated tool to support this activity. Our method relies on a data model whose constructs allows a designer to define an intensional description of the organization of data in a web site. Based on the model, we introduce the concepts of (i) intensional navigation, which represents an abstract description of the navigation to be performed to reach pages of interest, and (ii) extensional navigation, which represents the actual set of navigation paths (i.e. sequences of links to be followed) that lead the target pages. The method is supported by a tool that infers an intensional navigation, i.e. the crawling program, from one sample extensional navigation. The tool, which has been developed as a Firefox plug-in, supports the designer in the task of defining and verifying the sample navigation and the inferred crawling program.
Keywords :
Costs; Crawlers; Data mining; Data models; HTML; Information retrieval; Navigation; Web page design; Wrapping; Writing;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Information Reuse and Integration, 2008. IRI 2008. IEEE International Conference on
Conference_Location :
Las Vegas, NV, USA
Print_ISBN :
978-1-4244-2659-1
Electronic_ISBN :
978-1-4244-2660-7
Type :
conf
DOI :
10.1109/IRI.2008.4583023
Filename :
4583023
Link To Document :
بازگشت