Building Statistical Language Models of code

Author

Schulam, Peter ; Rosenfeld, Roni ; Devanbu, Premkumar

Author_Institution

Language Technol. Inst., Carnegie Mellon Univ., Pittsburgh, PA, USA

fYear

2013

fDate

21-21 May 2013

Firstpage

1

Lastpage

3

Abstract

We present the Source Code Statistical Language Model data analysis pattern. Statistical language models have been an enabling tool for a wide array of important language technologies. Speech recognition, machine translation, and document summarization (to name a few) all rely on statistical language models to assign probability estimates to natural language utterances or sentences. In this data analysis pattern, we describe the process of building n-gram language models over software source files. We hope that by introducing the empirical software engineering community to best practices that have been established over the years in research for natural languages, statistical language models can become a tool that SE researchers are able to use to explore new research directions.

Keywords

data analysis; natural languages; software engineering; source coding; statistical analysis; building statistical language models; document summarization; empirical software engineering community; machine translation; n-gram language models; natural language sentences; natural language utterances; software source files; source code data analysis pattern; speech recognition; Buildings; Data models; Natural languages; Smoothing methods; Software engineering; Speech recognition; Vocabulary;

fLanguage

English

Publisher

ieee

Conference_Titel

Data Analysis Patterns in Software Engineering (DAPSE), 2013 1st International Workshop on

Conference_Location

San Francisco, CA

Type

conf

DOI

10.1109/DAPSE.2013.6603797

Filename

6603797