مرکز منطقه ای اطلاع رساني علوم و فناوري - Schema inference and data extraction from templatized Web pages

DocumentCode :

702718

Title :

Schema inference and data extraction from templatized Web pages

Author :

Krishna, Shinde Santaji ; Dattatraya, Joshi Shashank

Author_Institution :

Dept. of Comput. Eng., Shri Jagdish Prasad Jhabarmal Tibrewala Univ., Vidyanagari, India

fYear :

2015

fDate :

8-10 Jan. 2015

Firstpage :

Lastpage :

Abstract :

The World Wide Web is a vast and rapidly growing source of information. A web data extraction system is a system that extracts data from web pages automatically. However, there are various web sites having most of the pages that contains structured data. Thus, for Web Information integration, an important step is to extract information from Web documents for the websites. This paper presents an unsupervised approach to providing page-level data extraction task. It automatically detects schema of web pages. Web pages are compared based on visual clues to find fixed/variant template pages. Then data region from web pages are extracted and if they belong to fixed template then, schema recognized by applying tree merging, tree alignment and mining techniques. For heterogeneous template pages, variant tree matching algorithm is used.

Keywords :

Web sites; data mining; document handling; inference mechanisms; tree data structures; Web Information integration; Web data extraction system; Web documents; Websites; World Wide Web; fixed-variant template pages; heterogeneous template pages; mining techniques; schema inference; templatized Web pages; tree alignment; tree merging; variant tree matching algorithm; visual clues; Data mining; Merging; Noise; Peer-to-peer computing; Visualization; Web pages; Data Extraction; Multiple Tree Merging; Schema; Vision-based Page Segmentation; Web page;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Pervasive Computing (ICPC), 2015 International Conference on

Conference_Location :

Pune

Type :

conf

DOI :

10.1109/PERVASIVE.2015.7087084

Filename :

7087084

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=702718