Title :
Building Statistical Language Models of code
Author :
Schulam, Peter ; Rosenfeld, Roni ; Devanbu, Premkumar
Author_Institution :
Language Technol. Inst., Carnegie Mellon Univ., Pittsburgh, PA, USA
Abstract :
We present the Source Code Statistical Language Model data analysis pattern. Statistical language models have been an enabling tool for a wide array of important language technologies. Speech recognition, machine translation, and document summarization (to name a few) all rely on statistical language models to assign probability estimates to natural language utterances or sentences. In this data analysis pattern, we describe the process of building n-gram language models over software source files. We hope that by introducing the empirical software engineering community to best practices that have been established over the years in research for natural languages, statistical language models can become a tool that SE researchers are able to use to explore new research directions.
Keywords :
data analysis; natural languages; software engineering; source coding; statistical analysis; building statistical language models; document summarization; empirical software engineering community; machine translation; n-gram language models; natural language sentences; natural language utterances; software source files; source code data analysis pattern; speech recognition; Buildings; Data models; Natural languages; Smoothing methods; Software engineering; Speech recognition; Vocabulary;
Conference_Titel :
Data Analysis Patterns in Software Engineering (DAPSE), 2013 1st International Workshop on
Conference_Location :
San Francisco, CA
DOI :
10.1109/DAPSE.2013.6603797