Refactoring Programs to Improve the Performance of Deep Learning for Vulnerability Detection
Software vulnerabilities allow attackers to take down important services and steal users’ private data. Many new vulnerabilities are reported each year, showing that vulnerabilities are prevalent in software programs. Therefore, it is critically important for developers to detect vulnerabilities before releasing their software. Recently, deep learning models have been successfully trained to detect vulnerabilities by learning to classify vulnerable and non-vulnerable code from open-source projects on Github. However, the existing datasets suffer from limited and imbalanced data, and both factors hurt the models’ performance.
We implemented a framework for automatically applying refactoring as a data augmentation technique to increase the diversity of program datasets and address data imbalance. Our refactoring framework can be tuned for different models and datasets. We evaluated our approach by using it to train state-of-the-art deep learning models. Our results show that naively refactoring programs does not significantly improve model performance. However, our method can be tuned to improve model performance by producing diverse programs and targeting imbalanced data. We found that some refactorings decrease model performance because they introduce tokens that are out of the model’s vocabulary. We also found that our method can be applied to the majority of programs in practice. Based on our results, we believe that refactoring is a useful data augmentation technique that will benefit further research and applications of deep learning for vulnerability detection.
Committee: Wei Le (major professor), Myra Cohen, and Hongyang Gao
Join on Zoom: Meeting ID: 992 5473 5260