Title :
On developing an effectual progressive sampling-based approach for association rule discovery
Author :
Umarani, V. ; Punithavalli, M.
Author_Institution :
Sri Ramakrishna Coll. of Arts &Sci. for Women, Coimbatore, India
Abstract :
Association rule discovery from large databases is one of the most challenging tasks in data mining. The process of frequent itemset mining, the first step in the mining of association rules, is a computational and I/O intensive process necessitating repeated passes over the entire database. Sampling has often been suggested as an effectual tool to reduce the size of the dataset operated at some cost to accuracy. Data mining literature presents with numerous sampling based approaches to speed up the process of Association Rule Mining (ARM). In our earlier research, we presented a proficient progressive sampling-based approach for mining association rules from massive databases. In this article, we validate our earlier approach with different empirical variations and also present an analysis on the validations using synthetic datasets. The approach starts with an initial sample selection process based on the temporal characteristics and size of the database. Subsequently, the frequent itemsets and the negative border are mined from the initial sample using Apriori algorithm. The patterns in the negative border are then sorted based on their support and the midpoint itemset in the sorted negative border is scanned in different variations (sizes) of the database to check its frequency. If the support of the midpoint itemset is greater than the support threshold, the sample size is progressively increased to a larger size. The aforesaid process is repeated until an optimal sample size is met and then association rules are mined from the optimal sample determined. The empirical validation also results the appropriate database size for conducting the midpoint itemset scan.
Keywords :
data mining; input-output programs; ARM; Apriori algorithm; I/O intensive process; association rule discovery; association rule mining; data mining; effectual progressive sampling; numerous sampling; Art; Association rules; Computer science; Costs; Data mining; Educational institutions; Frequency; Itemsets; Sampling methods; Transaction databases; Apriori; Association Rule Mining (ARM); Data mining; Frequent Patterns; Midpoint itemset; Negative border; Progressive sampling; Sampling; Temporal characteristics;
Conference_Titel :
Information Management and Engineering (ICIME), 2010 The 2nd IEEE International Conference on
Conference_Location :
Chengdu
Print_ISBN :
978-1-4244-5263-7
Electronic_ISBN :
978-1-4244-5265-1
DOI :
10.1109/ICIME.2010.5477643