*Equal contribution
The ability to organically reason over and with both text and images is a pillar of human intelligence, yet the ability of Multimodal Large Language Models (MLLMs) to perform such multimodal reasoning remains under-explored. Existing benchmarks often emphasize text-dominant reasoning or rely on shallow visual cues, failing to adequately assess integrated visual and textual reasoning.
We introduce , a benchmark targeting organic multimodal reasoning across mathematics, physics, chemistry, and coding. EMMA tasks demand advanced cross-modal reasoning that cannot be addressed by reasoning independently in each modality, offering an enhanced test suite for MLLMs' reasoning capabilities. Our evaluation of state-of-the-art MLLMs on EMMA reveals significant limitations in handling complex multimodal and multi-step reasoning tasks, with even advanced techniques like Chain-of-Thought prompting and test-time compute scaling underperforming. These findings underscore the need for improved multimodal architectures and training paradigms to close the gap between human and model reasoning in multimodality.
Reset | EMMA | EMMA-Mini | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Name | Size | CoT | Overall | Math | Physics | Chemistry | Coding | Overall | Math | Physics | Chemistry | Coding |
Overall results of different models on the EMMA leaderboard. The best-performing model in each category is in-bold, and the second best is underlined.
EMMA is composed of 2,788 problems , of which 1,796 are newly constructed , across four domains: math, physics, chemistry, and coding. To provide fine-grained insights into how MLLMs might fail in multimodal reasoning, we assign labels to each problem in our benchmark. These labels are either created by domain experts or assigned by GPT-4o and subsequently verified by experts. As shown in n Category Figure,, questions in EMMA assess a wide array of multimodal reasoning skills. For example, the pattern inference problem in math challenges models to identify and generalize visual patterns; the visual decomposition simulation problem in physics requires graphically decomposing forces to determine resultant effects; the reaction simulation problem in chemistry demands precise interpretation and simulation of electron movement; the 3D visualization problem in coding.
Examples of our newly annotated datasets: IQTest, FunctionQA, and PaperQA.
Summary of the 31 different source datasets in MathVista.
Using the abels for each question based on the multimodal skills it assesses, we find that CoT prompting hurts performance on visual-reasoning-heavy tasks, while it benefits closed-source models on tasks where textual CoT is theoretically useful.
Distribution of error types made by o1 on the math and coding portions of EMMA-mini. The majority of errors arise in visual reasoning.
Click on the image to display the answer. Please check complete Coding questions here.
Math Case 1
Physics Case 1
Physics Case 2
Chemistry Case 1
More examples detailed in the paper.
@misc{hao2025mllmsreasonmultimodalityemma,
title={Can MLLMs Reason in Multimodality? EMMA: An Enhanced MultiModal ReAsoning Benchmark},
author={Yunzhuo Hao and Jiawei Gu and Huichen Will Wang and Linjie Li and Zhengyuan Yang and Lijuan Wang and Yu Cheng},
year={2025},
eprint={2501.05444},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2501.05444},
}