In today's software-centric world, ultra-large-scale software repositories, e.g., SourceForge, GitHub, and Google Code, with hundreds of thousands of projects each, are the new library of Alexandria. They contain an enormous corpus of software and information about software and software projects. Scientists and engineers alike are interested in analyzing this wealth of information to test important research hypotheses. However, the current barrier to entry is prohibitive because deep expertise and sophisticated tools are needed to write programs that access version control systems, store and retrieve workable data subsets, and perform the needed ultra-large-scale analysis. The goal is accelerate the pace of Software Engineering research and to increase reusability and replicability, while properly curating the data and analyses.
This project is building a CISE research infrastructure called Boa to aid and assist with such research and will be globally available. The project designs a new programming language that can hide the details of programmatically accessing version control systems, data storage and retrieval, data mining, and parallelization from the scientists and engineers and allow them to focus on the program logic. The project also designs a data mining infrastructure for Boa, and a BIGDATA repository containing 700,000+ open source projects for analyzing ultra-large-scale software repositories to help with such experiments. The broader impacts of Boa stem from its potential to enable developers, designers and researchers to build intuitive, multi-modal, user-centric, scientific applications that can aid and enable scientific research on individual, social, legal, policy, and technical aspects of open source software development. This advance will primarily be achieved by significantly lowering the barrier to entry and thus enabling a larger and more ambitious line of data-intensive scientific discovery in this area.