DocumentCode :
3275728
Title :
On the Cost of Mining Very Large Open Source Repositories
Author :
Banerjee, Sean ; Cukic, Bojan
Author_Institution :
Robot. Inst., Carnegie Mellon Univ., Pittsburgh, PA, USA
fYear :
2015
fDate :
23-23 May 2015
Firstpage :
37
Lastpage :
43
Abstract :
Open source bug tracking systems provide a rich information suite that is actively used by software engineering researchers to design solutions to triaging, duplicate classification and developer assignment problems. Today, open repositories often contain in excess of 100, 000 reports, and in cases of RedHat and Mozilla, over a million. Obtaining and analyzing the contents of such datasets are both time and resource consuming. By summarizing the related work we demonstrate that researchers often focused on smaller subsets of the data, and seldom embrace the “big-dataism”. With the emergence of cloud based computation systems such as Amazon EC2, one expects it to be easier to perform large scale analyses. However, our detailed time and cost analysis indicates that significant challenges still remain. Acquiring the open source data can be time intensive, and prone to being misinterpreted as Denial of Service attacks. Generating similarity scores for all prior reports, for example, is a polynomial time problem. In this paper, we present actual costs that we incurred when analyzing the complete repositories from Eclipse, Firefox and Open Office. In our approach, we relied on computing clusters to process the data in an attempt to reduce the cost of analyzing large datasets on the cloud. We present estimated costs for a researcher attempting to analyze complete datasets from Eclipse, Mozilla, Novell and RedHat using the best possible resources. In an ideal situation, with no bottlenecks, a researcher investing just over $40, 000 and 2 weeks of non stop computing time would be able to measure similarity of problem reports within all four datasets.
Keywords :
Big Data; cloud computing; computational complexity; data mining; public domain software; software engineering; Amazon EC2; Big-Dataism; Eclipse; Firefox; Novell; Open Office; RedHat; cloud based computation systems; cost analysis; data processing; denial of service attacks; open source bug tracking systems; polynomial time problem; software engineering; time analysis; very large open source repository mining; Accuracy; Computer crime; Data mining; Graphics processing units; Random access memory; XML;
fLanguage :
English
Publisher :
ieee
Conference_Titel :
Big Data Software Engineering (BIGDSE), 2015 IEEE/ACM 1st International Workshop on
Conference_Location :
Florence
Type :
conf
DOI :
10.1109/BIGDSE.2015.16
Filename :
7166057
Link To Document :
بازگشت