• DocumentCode
    60193
  • Title

    Data Quality: Some Comments on the NASA Software Defect Datasets

  • Author

    Shepperd, Martin ; Qinbao Song ; Zhongbin Sun ; Mair, C.

  • Author_Institution
    Dept. of IS & Comput., Brunel Univ., Uxbridge, UK
  • Volume
    39
  • Issue
    9
  • fYear
    2013
  • fDate
    Sept. 2013
  • Firstpage
    1208
  • Lastpage
    1215
  • Abstract
    Background--Self-evidently empirical analyses rely upon the quality of their data. Likewise, replications rely upon accurate reporting and using the same rather than similar versions of datasets. In recent years, there has been much interest in using machine learners to classify software modules into defect-prone and not defect-prone categories. The publicly available NASA datasets have been extensively used as part of this research. Objective--This short note investigates the extent to which published analyses based on the NASA defect datasets are meaningful and comparable. Method--We analyze the five studies published in the IEEE Transactions on Software Engineering since 2007 that have utilized these datasets and compare the two versions of the datasets currently in use. Results--We find important differences between the two versions of the datasets, implausible values in one dataset and generally insufficient detail documented on dataset preprocessing. Conclusions--It is recommended that researchers 1) indicate the provenance of the datasets they use, 2) report any preprocessing in sufficient detail to enable meaningful replication, and 3) invest effort in understanding the data prior to applying machine learners.
  • Keywords
    data analysis; learning (artificial intelligence); pattern classification; software reliability; IEEE Transactions on Software Engineering; NASA software defect dataset; National Aeronautics and Space Administration; data preprocessing; data quality; data replication; dataset provenance; defect-prone classification; machine learning; not-defect-prone classification; software module classification; Abstracts; Communities; Educational institutions; NASA; PROM; Software; Sun; Empirical software engineering; data quality; defect prediction; machine learning;
  • fLanguage
    English
  • Journal_Title
    Software Engineering, IEEE Transactions on
  • Publisher
    ieee
  • ISSN
    0098-5589
  • Type

    jour

  • DOI
    10.1109/TSE.2013.11
  • Filename
    6464273