M.S. Final Oral Exam: Seok Hwan Song
Speaker:Seok Hwan Song
Quantitative Reasoning Ability of Large Language Models under Noisy Data
Quantitative reasoning capabilities of Large Language Models (LLMs) have improved enormously in recent years. The improvements have been shown through solving Math Word Problems (MWPs) and Chart Question Answering. However, real-world data often contains irrelevant information. Ideally, LLM should also perform accurately when given irrelevant information (i.e., noise) to derive answers to a user’s question. To assess LLM’s quantitative reasoning ability on noisy data, we propose a new dataset, Math Word Problems with Noises (MPN). This dataset has three types of noises added to MWP problems selected from four public datasets. We propose a new solution, Noise Reduction Prompting (NRP) and its variant (NRP+), to handle noisy data. We evaluated LLM’s quantitative reasoning ability on MPN and two datasets with noises: GSM-IC and PlotQA. Some key findings are as follows. Both ChatGPT (gpt-3.5-turbo) and PaLM2 have a significant drop in absolute accuracy on MPN compared to the same MWP problems without noises. The state-of-the-art methods, namely, Chain of Thought Prompting, Least-To-Most Prompting, Progressive-Hint Prompting, and Program-aided Language Models have an average accuracy drop of 14.4%, 12.2%, 17.2%, and 24.1%, respectively, while our NRP and NRP+ limit a drop in average accuracy to only around 1.9% and 5.1%, respectively. NRP+ performs the best on MPN without noise. Compared to GSM-IC, our types of noise are more difficult. NRP+ outperforms existing prompt methods, such as CoT and LTM by 2.7% and 11.2% on average.
Committee: Wallapak Tavanapong (major professor), Qi Li, and Julie Dickerson