Abstract: Large language models (LLMs) have shown strong performance in solving math problems, and there is growing research on evaluating their robustness. Unlike previous studies that create problem variants by adding perturbations to a single problem, this paper focuses on the interaction between problems. Specifically, we combine two original problems into one with a logical connection and measure the LLM's performance on both the new and original problems. We propose an automated pipeline with 98.2\% accuracy and conduct extensive experiments on three datasets (1 manual, 2 synthetic), covering 13 LLMs of various sizes, including open-source models (7B to 671B) and closed-source models like o1-mini and o1-preview. Results show that simple format combinations can significantly reduce LLM performance, even when the underlying math remains unchanged. Additionally, combined problems with intermediate variables and values offer a better way to evaluate solution accuracy. Finally, we analyze the impact of factors like difficulty and length on LLM performance, offering insights for future research.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: evaluation,benchmarking
Contribution Types: Model analysis & interpretability, Data resources, Data analysis
Languages Studied: english
Submission Number: 8499
Loading