PhD Preliminary Oral Exam: Scott Song
Systematic Evaluation of Input Risks in Large and Multimodal Language Models Reasoning
Large Language Models (LLMs) and Multimodal Language Models (MLMs) have demonstrated impressive reasoning capabilities across diverse tasks, yet their performance is sensitive to variations in input. We systematically examine the risks caused by different input conditions and their impact on reasoning performance. We first investigate the influence of irrelevant information in math word problems with different LLMs, showing that distracting elements can influence LLM performance. We propose a new dataset, Math Word Problems with Irrelevant Noise (MPN), along with a prompting method, Noise Reduction Prompting (NRP), to mitigate these effects. Next, we explore how reasoning performance differs across question formats (short answer, multiple choice, and true/false), revealing significant inconsistencies in both reasoning and final-answer accuracy that challenge the reliability of existing benchmarks. Finally, we introduce ongoing work on multimodal reasoning, focusing on biases introduced by chart image, such as variations in Y-axis scaling and formatting, which can influence MLM performance on chart-to-table translation tasks. Collectively, these studies demonstrate that input characteristics such as noise, question format, and visual information profoundly influence model reasoning and evaluation. This research underscores the need for carefully controlled benchmarks, not only to enable robust and reproducible assessments of models, but also to guide the development of stronger models themselves.
Committee: Wallapak Tavanapong (major professor), Ying Cai, Mengdi Huai, Qi Li, and Julie Dickerson