مرکز منطقه ای اطلاع رساني علوم و فناوري - On the Cost of Mining Very Large Open Source Repositories

DocumentCode :

3275728

Title :

On the Cost of Mining Very Large Open Source Repositories

Author :

Banerjee, Sean ; Cukic, Bojan

Author_Institution :

Robot. Inst., Carnegie Mellon Univ., Pittsburgh, PA, USA

fYear :

2015

fDate :

23-23 May 2015

Firstpage :

Lastpage :

Abstract :

Open source bug tracking systems provide a rich information suite that is actively used by software engineering researchers to design solutions to triaging, duplicate classification and developer assignment problems. Today, open repositories often contain in excess of 100, 000 reports, and in cases of RedHat and Mozilla, over a million. Obtaining and analyzing the contents of such datasets are both time and resource consuming. By summarizing the related work we demonstrate that researchers often focused on smaller subsets of the data, and seldom embrace the “big-dataism”. With the emergence of cloud based computation systems such as Amazon EC2, one expects it to be easier to perform large scale analyses. However, our detailed time and cost analysis indicates that significant challenges still remain. Acquiring the open source data can be time intensive, and prone to being misinterpreted as Denial of Service attacks. Generating similarity scores for all prior reports, for example, is a polynomial time problem. In this paper, we present actual costs that we incurred when analyzing the complete repositories from Eclipse, Firefox and Open Office. In our approach, we relied on computing clusters to process the data in an attempt to reduce the cost of analyzing large datasets on the cloud. We present estimated costs for a researcher attempting to analyze complete datasets from Eclipse, Mozilla, Novell and RedHat using the best possible resources. In an ideal situation, with no bottlenecks, a researcher investing just over $40, 000 and 2 weeks of non stop computing time would be able to measure similarity of problem reports within all four datasets.

Keywords :

Big Data; cloud computing; computational complexity; data mining; public domain software; software engineering; Amazon EC2; Big-Dataism; Eclipse; Firefox; Novell; Open Office; RedHat; cloud based computation systems; cost analysis; data processing; denial of service attacks; open source bug tracking systems; polynomial time problem; software engineering; time analysis; very large open source repository mining; Accuracy; Computer crime; Data mining; Graphics processing units; Random access memory; XML;

fLanguage :

English

Publisher :

ieee

Conference_Titel :

Big Data Software Engineering (BIGDSE), 2015 IEEE/ACM 1st International Workshop on

Conference_Location :

Florence

Type :

conf

DOI :

10.1109/BIGDSE.2015.16

Filename :

7166057

Link To Document :

https://search.ricest.ac.ir/dl/search/defaultta.aspx?DTC=49&DC=3275728