مرکز منطقه ای اطلاع رساني علوم و فناوري - Efficient schema extraction from large XML documents

DocumentCode :

2134980

Title :

Efficient schema extraction from large XML documents

Author :

Yin Zhang ; Hua Zhou ; Junhui Liu ; Zhihong Liang ; Peng Duan

Author_Institution :

Key Lab. of Software Eng. of Yunnan Province, Yunnan Univ., Kunming, China

fYear :

2012

fDate :

16-18 Oct. 2012

Firstpage :

1255

Lastpage :

1260

Abstract :

Although the presence of a schema enables many optimizations for operations on XML documents, several studies have shown that many XML documents in practice either do not refer to a schema, or refer to a syntactically incorrect one. It is therefore of utmost importance to provide tools and techniques that can automatically generate XML Schema Definitions from sets of sample documents. While previous work in this area has mostly focused on the method based on regular expressions, we consider its many inadequacies. We provide a theoretically complete algorithm that always infers the correct XSDs when a sufficiently large corpus of XML documents is available. In addition, XTree impressively minimizes the necessary time and main memory to extract the schema. Our approach features several advantages over known techniques: XTree scales to very large documents (beyond 1 GB) both in time and memory consumption; it is able to extract a general, complete, correct, minimal, and readable schema for complex documents; it detects elements appear as a sequence or choice. Experiments confirm these features and properties.

Keywords :

XML; information retrieval; storage management; tree data structures; XSD; XTree; automatic XML schema definition generation; complete algorithm; eXtensible Markup Language; large XML document corpus; memory consumption; schema extraction; time consumption; Automatic schema extraction; Large XML documents; XML; XML schema;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Biomedical Engineering and Informatics (BMEI), 2012 5th International Conference on

Conference_Location :

Chongqing

Print_ISBN :

978-1-4673-1183-0

Type :

conf

DOI :

10.1109/BMEI.2012.6513057

Filename :

6513057

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=2134980