مرکز منطقه ای اطلاع رساني علوم و فناوري - Automated metadata and instance extraction from news Web sites

DocumentCode :

2182860

Title :

Automated metadata and instance extraction from news Web sites

Author :

Vadrevu, Srinivas ; Nagarajan, Saravanakumar ; Gelgi, Fatih ; Davulcu, Hasan

Author_Institution :

Dept. of Comput. Sci. & Eng., Arizona State Univ., Tempe, AZ, USA

fYear :

2005

fDate :

19-22 Sept. 2005

Firstpage :

Lastpage :

Abstract :

Over the past few years World Wide Web has established as a vital resource for news. With the continuous growth in the number of available news Web sites and the diversity in their presentation of content, there is an increasing need to organize the news related information on the Web and keep track of it. In this paper, we present automated techniques for extracting metadata instance information by organizing and mining a set of news Web sites. We develop algorithms that detect and utilize HTML regularities in the Web documents to turn them into hierarchical semantic structures encoded as XML. The tree-mining algorithms that we present identify key domain concepts and their taxonomical relationships. We also extract semi-structured concept instances annotated with their labels whenever they are available. We report experimental evaluation for the news domain to demonstrate the efficacy of our algorithms.

Keywords :

Web sites; XML; knowledge acquisition; meta data; HTML regularity; Web documents; World Wide Web; XML; automated metadata; hierarchical semantic structure; metadata instance information; news Web site; tree-mining algorithm; Computer science; Data mining; HTML; Mediation; Organizing; Partitioning algorithms; Taxonomy; Web pages; Web sites; XML;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Web Intelligence, 2005. Proceedings. The 2005 IEEE/WIC/ACM International Conference on

Print_ISBN :

0-7695-2415-X

Type :

conf

DOI :

10.1109/WI.2005.38

Filename :

1517813

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=2182860