Ph.D. Preliminary Oral Exam: Hung Phan

Event
Speaker: 
Hung Phan
Friday, October 7, 2022 - 1:00pm
Event Type: 

The Graph in Software Engineering and The Graph in Natural Language Processing: Combining, Applying, and Improving The Performance of Types of Graph Representation in Software Engineering Researches

Graph Neural Networks (GNNs) and graph representation have many applications in Natural Language Processing. In Software Engineering (SE), Graph Neural Networks models have been used for learning information from the tree/graph data structure of source code for solving some problems such as method name generation and variable misuse detection.  We realize three limitations of applying GNNs and graph representation in SE research. First, while GNN models have been applied with the input as source code, it has rarely been applied in natural language artifacts in SE, such as code comment, software issues, or other software documentation.  Second, while Natural Language Parse Tree (NLPT) with dependence edges is a well-known graph representation in natural language processing, it has not been considered when building popular natural language- programming language models, unlike its counterparts for source code in SE such as the Abstract Syntax Tree, the Data Flow Graph and the Control Flow Graph. Third, the most popular NLPT generator, StanfordCoreNLP, was trained from a natural language corpus, which hasn't been an optimized source representing programming-related information.

In this proposal, we attempt to answer whether these limitations are existed in SE and propose approaches to overcome them. In the first chapter of this project, we study the applicabilities of applying original Text-Level Graph Neural Networks in NLP for Software Effort Estimation, given the title and description of software issues. We show that original GNN models and graph representation in NLP got low accuracy and consumed worse running time than other traditional approaches for this problem. In the second chapter of this proposal, we propose a Heterogeneous Graph Neural Networks approach called HeteroSP to optimize the graph representation and learning which reflects the characteristics of software issues. By the evaluation, we observe that HeteroSP achieved the lowest error rates in the most challenging configuration over two state-of-the-art works Deep-SE and GPT2SP. In addition, our work required only 600 seconds for training and testing on the whole dataset, while in prior work, Deep-SE required at least 16 hours to complete the same process. In chapter 3, we propose PropMiner, a library that took advantage of quantities properties of NLPT and AST for improving the original NL-PL model GraphCodeBERT in code search and code summarization. In the evaluation, we show that PropMiner can help GraphCodeBERT to increase the MRR of code search by 1.5\% and the BLEU score of code summarization by 1.97\%. In the full implementation of this proposal, we attempt to design approaches for improving the original NLPT to embed source code information and optimize the heterogeneous graph neural network models for other problems such as auto parallelization.

Committee: Ali Jannesari (major professor), Myra Cohen, Wei Le, Carl Chang, and Chris Quinn

Exam Location: Atanasoff 235