• DocumentCode
    652664
  • Title

    Can Automated Text Classification Improve Content Analysis of Software Project Data?

  • Author

    Noll, James ; Seichter, Dominik ; Beecham, Sarah

  • Author_Institution
    Irish Software Eng. Res. Centre, Univ. of Limerick, Limerick, Ireland
  • fYear
    2013
  • fDate
    10-11 Oct. 2013
  • Firstpage
    300
  • Lastpage
    303
  • Abstract
    Content analysis is a useful approach for analyzing unstructured software project data, but it is labor-intensive and slow. Can automated text classification (using supervised machine learning) be used to reduce the labor or improve the speed of content analysis? We conducted a case study involving data from a previous study that employed content analysis of an open source software project. We used a human-coded data set with 3256 samples to create different size training sets ranging in size from 100 to 3000 samples to train an "ensemble" text classifier to assign one of five different categories to a test set of samples. The results show that the automated classifier could be trained to recognize categories, but much less accurately than the human classifiers. In particular, both precision and recall for low-frequency categories was very low (less than 20%). Nevertheless, we hypothesize that automated classifiers could be used to filter a sample to identify common categories before human researchers examine the remainder for more difficult categories.
  • Keywords
    data analysis; learning (artificial intelligence); pattern classification; project management; public domain software; software management; text analysis; automated text classification; content analysis; ensemble text classifier; human-coded data set; open source software project; size training sets; supervised machine learning; unstructured software project data analysis; Accuracy; Encoding; Message systems; Software; Software engineering; Software measurement; Training; Content Analysis; Machine Learning; Open Source Software; Qualitative Research; Software Engineering; Text Classification;
  • fLanguage
    English
  • Publisher
    ieee
  • Conference_Titel
    Empirical Software Engineering and Measurement, 2013 ACM / IEEE International Symposium on
  • Conference_Location
    Baltimore, MD
  • ISSN
    1938-6451
  • Print_ISBN
    978-0-7695-5056-5
  • Type

    conf

  • DOI
    10.1109/ESEM.2013.52
  • Filename
    6681372