Title :
Exploring Java software vocabulary: A search and mining perspective
Author :
Linstead, Erik ; Hughes, Lindsey ; Lopes, Cristina ; Baldi, Pierre
Author_Institution :
Sch. of Inf. & Comput. Sci., Univ. of California, Irvine, CA
Abstract :
We conduct a large-scale analysis of Java source code vocabulary for 12,151 open source projects from Source-Forge and Apache, a corpus substantially larger than considered previously. Simple statistical analysis demonstrates robust power-law behavior for word count distributions across multiple program entities. We then identify salient vocabulary trends for classes, interfaces, methods, and fields. Our results provide low-level insight into the vocabulary space governing Java software development, with direct application to program comprehension and software search. Supplementary material may be found at: http://sourcerer.ics.uci.edu/suite2009/suite.html.
Keywords :
Java; data mining; statistical analysis; Apache; Java software development; Java software vocabulary; Java source code vocabulary; Source-Forge; large-scale analysis; mining perspective; multiple program entities; program comprehension; robust power-law behavior; search perspective; software search; statistical analysis; word count distribution; Application software; Computer languages; Information retrieval; Internet; Java; Large-scale systems; Natural languages; Software tools; Statistical analysis; Vocabulary;
Conference_Titel :
Search-Driven Development-Users, Infrastructure, Tools and Evaluation, 2009. SUITE '09. ICSE Workshop on
Conference_Location :
Vancouver, BC
Print_ISBN :
978-1-4244-3740-5
DOI :
10.1109/SUITE.2009.5070017