Identify Algorithms in Code
Choosing an algorithm to use can depend on a variety of factors such as runtime, space, and problem requirements. Many algorithms already have tested implementations in open source code. Reusing or interchanging algorithms can help save development time and improve the performance of applications.
Existing code search techniques often rely heavily on natural language components of the code. Simple techniques, such as Grep, are sensitive to the naming choices and conventions in code. Grep in particular do not precisely find implementations, outputting single lines. Grep does not rank the result, and is subject to lots of noise.
We develop a technique to search for algorithms in code using existing pseudo code as a query. We leverage the structural, mathematical and natural language components of pseudo code to find its corresponding implementation in code. This approach defines a simple language to represent pseudo code with atoms that include different features of the algorithm. We then use these features to search code using a bounding box and extract the code snippet that contains the functionality of the pseudo code.
We collected 19 different repositories in both C and Java and searched for 27 different algorithms. Using our technique we found over 60 algorithm implementations in roughly 1.8 million lines of code. We also conduct a comparison of our tool against a search implementation using a popular enterprise search platform Apache Solr, and show our approach can find more algorithms with high rank.
Committee: Wei Le (major professor), Pavan Aduri, David Fernandez-Baca