Abstract :
When large data repositories are coupled with geographic distribution of data, users and systems, it is necessary to combine different technologies for implementing high-performance distributed knowledge discovery systems. On the other hand, computational grid is emerging as a very promising infrastructure for high-performance distributed computing. Grid applications such as astronomy, chemistry, engineering, climate studies, geology, oceanography, ecology, physics, biology, health sciences and computer science often involve large amounts of computing and/or data. For these reasons, we think grids can offer an effective support to the implementation and use of parallel and distributed data mining systems. This paper describes development of parallel and distributed prior algorithm on grid environment. Apriori algorithm along with FP-growth (frequent pattern growth) is implemented on grid network in each grid node, which finds the local support counts and prunes all infrequent item sets. After completing local pruning, each grid node broadcasts messages containing all the remaining frequent patterns to the coordinator. We have compared the output of conventional method of apriori algorithm with FP-tree in both homogenous and heterogeneous environments. Practical datasets are large in nature and taken from the UCI machine repository and are related to adult, mushroom, and letter recognition, are used to measure the system performance. The detailed experiment procedure and result analysis are also discussed in this paper. In future the security issue among different local datasets and the huge communication cost in data migration can be considered.
Keywords :
data mining; grid computing; parallel processing; pattern classification; UCI machine repository; computational grid; data migration; data repositories; distributed apriori algorithms; distributed computing; distributed data mining systems; distributed knowledge discovery systems; frequent pattern growth; geographic distribution; parallel data mining systems; security; Application software; Astrochemistry; Astronomy; Biology computing; Chemical technology; Data engineering; Distributed computing; Grid computing; Marine technology; Space technology; Association rules; Data mining; Grid computing; High performance;