From A and B to A+B: Can Large Language Models Solve Compositional Math Problems?

From A and B to A+B: Can Large Language Models Solve Compositional Math Problems?

ACL ARR 2025 February Submission8499 Authors

16 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Large language models (LLMs) have shown strong performance in solving math problems, and there is growing research on evaluating their robustness. Unlike previous studies that create problem variants by adding perturbations to a single problem, this paper focuses on the interaction between problems. Specifically, we combine two original problems into one with a logical connection and measure the LLM's performance on both the new and original problems. We propose an automated pipeline with 98.2\% accuracy and conduct extensive experiments on three datasets (1 manual, 2 synthetic), covering 13 LLMs of various sizes, including open-source models (7B to 671B) and closed-source models like o1-mini and o1-preview. Results show that simple format combinations can significantly reduce LLM performance, even when the underlying math remains unchanged. Additionally, combined problems with intermediate variables and values offer a better way to evaluate solution accuracy. Finally, we analyze the impact of factors like difficulty and length on LLM performance, offering insights for future research.

Paper Type: Long

Research Area: Resources and Evaluation

Research Area Keywords: evaluation,benchmarking

Contribution Types: Model analysis & interpretability, Data resources, Data analysis

Languages Studied: english

Submission Number: 8499

Loading