A dataset for evaluating identifier splitters

Author

Binkley, David ; Lawrie, Dawn ; Pollock, Lori ; Hill, Emily ; Vijay-Shanker, K.

Author_Institution

Loyola Univ. Maryland, Baltimore, MD, USA

fYear

2013

fDate

18-19 May 2013

Firstpage

401

Lastpage

404

Abstract

Software engineering and evolution techniques have recently started to exploit the natural language information in source code. A key step in doing so is splitting identifiers into their constituent words. While simple in concept, identifier splitting raises several challenging issues, leading to a range of splitting techniques. Consequently, the research community would benefit from a dataset (i.e., a gold set) that facilitates comparative studies of identifier splitting techniques. A gold set of 2,663 split identifiers was constructed from 8,522 individual human splitting judgements and can be obtained from www.cs.loyola.edu/~binkley/ludiso. This set´s construction and observations aimed at its effective use are described.

Keywords

computational linguistics; program interpreters; software engineering; source coding; constituent words; human splitting judgements; identifier splitter evaluation dataset; identifier splitting techniques; natural language information; software engineering; software evolution techniques; source code; Data mining; Educational institutions; Gold; Java; Software; Speech recognition;

fLanguage

English

Publisher

ieee

Conference_Titel

Mining Software Repositories (MSR), 2013 10th IEEE Working Conference on

Conference_Location

San Francisco, CA

ISSN

2160-1852

Print_ISBN

978-1-4799-0345-0

Type

conf

DOI

10.1109/MSR.2013.6624055

Filename

6624055