مرکز منطقه ای اطلاع رساني علوم و فناوري - Gathering metadata from Web-based repositories of historical publications

DocumentCode :

2276576

Title :

Gathering metadata from Web-based repositories of historical publications

Author :

Sanz, Ismael ; Berlanga, Rafael ; Aramburu, Maria Jose

Author_Institution :

Dept. d´´Inf., Univ. Jaume I, Castellon, Spain

fYear :

1998

fDate :

25-28 Aug 1998

Firstpage :

473

Lastpage :

478

Abstract :

Building digital libraries from Internet-accessible document repositories is a challenging task, due to the current mismatch between the desired DBMS-like capabilities of the former and the schemeless HTML files stored in web sires. In order to address this problem, we propose a distributed architecture for the extraction of metadata from WWW documents specially suited for repositories of historical publications, like newspapers. In this paper we present an information extraction system based on semi-structured data analysis. Starting from several combinations of the HTML styles that abstract the visual characteristics of documents, the proposed system infers the logical structure and attributes of HTML texts. Additionally, by using context-free grammars the system extracts the overall web structure of the repositories. The system output is a metadata object that contains a concise representation of the corresponding publication and its components

Keywords :

Internet; information retrieval systems; knowledge acquisition; HTML files; WWW documents; Web-based repositories; context-free grammars; distributed architecture; document repositories; information extraction system; metadata; metadata object; Crawlers; Data mining; Electronic switching systems; HTML; Information retrieval; Information systems; Internet; Postal services; Software libraries; World Wide Web;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Database and Expert Systems Applications, 1998. Proceedings. Ninth International Workshop on

Conference_Location :

Vienna

Print_ISBN :

0-8186-8353-8

Type :

conf

DOI :

10.1109/DEXA.1998.707442

Filename :

707442

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=2276576