Title :
Toward Deep Learning Software Repositories
Author :
White, Martin ; Vendome, Christopher ; Linares-Vasquez, Mario ; Poshyvanyk, Denys
Author_Institution :
Dept. of Comput. Sci., Coll. of William & Mary, Williamsburg, VA, USA
Abstract :
Deep learning subsumes algorithms that automatically learn compositional representations. The ability of these models to generalize well has ushered in tremendous advances in many fields such as natural language processing (NLP). Recent research in the software engineering (SE) community has demonstrated the usefulness of applying NLP techniques to software corpora. Hence, we motivate deep learning for software language modeling, highlighting fundamental differences between state-of-the-practice software language models and connectionist models. Our deep learning models are applicable to source code files (since they only require lexically analyzed source code written in any programming language) and other types of artifacts. We show how a particular deep learning model can remember its state to effectively model sequential data, e.g., Streaming software tokens, and the state is shown to be much more expressive than discrete tokens in a prefix. Then we instantiate deep learning models and show that deep learning induces high-quality models compared to n-grams and cache-based n-grams on a corpus of Java projects. We experiment with two of the models´ hyper parameters, which govern their capacity and the amount of context they use to inform predictions, before building several committees of software language models to aid generalization. Then we apply the deep learning models to code suggestion and demonstrate their effectiveness at a real SE task compared to state-of-the-practice models. Finally, we propose avenues for future work, where deep learning can be brought to bear to support model-based testing, improve software lexicons, and conceptualize software artifacts. Thus, our work serves as the first step toward deep learning software repositories.
Keywords :
Java; learning (artificial intelligence); natural language processing; program testing; project management; source code (software); Java project corpus; NLP techniques; SE community; SE task; automatic compositional representation learning; code suggestion; connectionist models; deep-learning software repositories; high-quality models; hyperparameters; lexically analyzed source code files; natural language processing; programming language; sequential data model; software artifact conceptualization; software corpora; software engineering community; software language modeling; software lexicon improvement; software token streaming; Computational modeling; Computer architecture; Context; Context modeling; Machine learning; Software; Training; Software repositories; deep learning; machine learning; n-grams; neural networks; software language models;
Conference_Titel :
Mining Software Repositories (MSR), 2015 IEEE/ACM 12th Working Conference on
Conference_Location :
Florence
DOI :
10.1109/MSR.2015.38