Title :
Data File Layout Inference Using Content-Based Oracles
Author :
Phillips, Reid A. ; Wing-Ning Li ; Thompson, Charlotte ; Deneke, Wesley
Author_Institution :
Comput. Sci. & Comput. Eng. Dept., Univ. of Arkansas, Fayetteville, AR, USA
Abstract :
Data file layout inference refers to the problem of identifying the organizational characteristics associated with a structured text file, where every record in a text file shares the same structural properties. These properties include: character encoding, record length, field length (indicated by delimiting characters or fixed length), field position, and field semantic content. Within this paper, the above information is referred to as the layout of a file. This structural layout information is required to extract, transform, and load files into workflows within various data warehouse and data mining applications. A common need, layout inference is a manual, labor intensive process requiring human expertise whenever a file´s layout is unavailable, miscommunicated, or changed. This paper proposes an automated methodology for solving the layout inference problem by discovering the metadata of a structured text file and reports the results of a prototype system for real data files from customer data integration and management application.
Keywords :
data mining; file organisation; inference mechanisms; character encoding; content-based Oracles; customer data integration; data file layout inference; data mining; data warehouse; extract-transform-load; field length; field position; field semantic content; organizational characteristic; record length; structural layout information; structured text file; Context; Data mining; Encoding; Layout; Market research; Semantics; XML; combinatoric approach; content type; domain-specific software architecture; extract-transform-load (ETL); file layout inference; file processing; meta-data discovery; sampling;
Conference_Titel :
Computational Science and Engineering (CSE), 2013 IEEE 16th International Conference on
Conference_Location :
Sydney, NSW
DOI :
10.1109/CSE.2013.150