Title :
Using Shannon Entropy in ETL Processes
Author :
Balta, Marian ; Felea, Victor
Author_Institution :
Al. I. Cuza Univ., Iasi
Abstract :
The ETL (extract, transform and load) processes are responsible for the extraction of the data from the external sources, transforming the data in order to satisfy the integration and cleanness needs and for loading the data into the data warehouse. In the data mining field, there is a special concern on using the metrics for efficient classification algorithms. One of these approaches is the one that uses metrics on partitions, based on the Shannon entropy, to study the degree of concentration of values. In this paper we show how this idea can be used in verification of the consistency of data loaded into the data warehouse by ETL processes. We calculate the Shannon entropy and Gini index on partitions induced by attribute sets and we show that these values can be used to signal a possible problem in the data extraction process. We also show how the choice of the set of attributes determining the partition can have a significant impact on the effectiveness of the method.
Keywords :
data analysis; entropy; ETL process; Gini index; Shannon entropy; classification algorithm; data consistency verification; data extraction; data mining; data warehouse; Classification algorithms; Computer science; Data analysis; Data mining; Data warehouses; Entropy; Load management; Partitioning algorithms; Scientific computing; Signal processing;
Conference_Titel :
Symbolic and Numeric Algorithms for Scientific Computing, 2007. SYNASC. International Symposium on
Conference_Location :
Timisoara
Print_ISBN :
978-0-7695-3078-8
DOI :
10.1109/SYNASC.2007.41