• DocumentCode
    731533
  • Title

    Fuse: A Reproducible, Extendable, Internet-Scale Corpus of Spreadsheets

  • Author

    Barik, Titus ; Lubick, Kevin ; Smith, Justin ; Slankas, John ; Murphy-Hill, Emerson

  • Author_Institution
    ABB Corp. Res., Raleigh, NC, USA
  • fYear
    2015
  • fDate
    16-17 May 2015
  • Firstpage
    486
  • Lastpage
    489
  • Abstract
    Spreadsheets are perhaps the most ubiquitous form of end-user programming software. This paper describes a corpus, called Fuse, containing 2,127,284 URLs that return spreadsheets (and their HTTP server responses), and 249,376 unique spreadsheets, contained within a public web archive of over 26.83 billion pages. Obtained using nearly 60,000 hours of computation, the resulting corpus exhibits several useful properties over prior spreadsheet corpora, including reproducibility and extendability. Our corpus is unencumbered by any license agreements, available to all, and intended for wide usage by end-user software engineering researchers. In this paper, we detail the data and the spreadsheet extraction process, describe the data schema, and discuss the trade-offs of Fuse with other corpora.
  • Keywords
    Internet; software engineering; spreadsheet programs; Fuse; Internet-scale corpus; data schema; end-user programming software; end-user software engineering researchers; extendable-scale corpus; public Web archive; reproducible-scale corpus; spreadsheet corpora; spreadsheet extraction process; Data mining; Fuses; Metadata; Pipelines; Software; Software engineering; Uniform resource locators; MapReduce; corpus; dataset; end-user software engineering; spreadsheets;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Mining Software Repositories (MSR), 2015 IEEE/ACM 12th Working Conference on
  • Conference_Location
    Florence
  • Type

    conf

  • DOI
    10.1109/MSR.2015.70
  • Filename
    7180124