Ph.D. Final Oral Exam: Hung Phan
Speaker:Hung Phan
Designing Artifact Representation and Automated Pipeline for Machine Learning based Software Engineering
In recent times, the applications of Natural Language Processing (NLP) models have inspired numerous researchers to propose various automated Software Engineering (SE) models, including software estimation, natural language to code translation, and natural language to code search. A major challenge in achieving high accuracy in SE tasks with NLP models stems from the distinct characteristics of Software Engineering artifacts compared to typical NLP artifacts. This thesis analyzes two types of SE artifacts: software documentation and code snippets. In this work, I aim to optimize the performance of Large Language Models (LLMs) in SE through two research directions. In the first, I propose several representation techniques for SE artifacts to accurately reflect their characteristics as inputs for SE models. In the second direction, I enhance machine learning model pipelines by adapting them to our new artifact representations and incorporating classical machine translation models to leverage their flexibility in code generation for automated SE tasks. The practical outcomes of my research are three engines addressing two SE tasks. The first is HeteroSP, a Heterogeneous Graph Neural Networks model for estimating story points from software issues, which surpasses the traditional software estimation approach using the Recurrent Neural Network model, DeepSE. The second is ASTTrans-CS, a code search model that leverages query-to-ASTTrans Representation through Neural Machine Translation to enhance well-known Information Retrieval models like GraphCodeBERT and UniXcoder, thereby improving code search accuracy in the newly curated CAT benchmark of code queries and code snippets. Addressing the limitations of ASTTrans, which performs better with smaller datasets of fewer than 5000 code snippets, I introduce Oracle4CS. This approach integrates the more advanced code representation technique, ASTSum, and another pipeline utilizing Statistical Machine Translation for code search on the standard CodeSearchNet dataset. Through evaluation, I demonstrate that classical Statistical Machine Translation can significantly enhance the code search process when used as an oracle to enrich input queries. My proposed artifact representations and pipelines show promising potential to enhance the performance of NLP models in SE tasks.
Committee: Ali Jannesari (major professor), Myra Cohen, Wei Le, Carl Chang, and Chris Quinn