*Equal contribution
🔥 [2025-03-02]: Exciting updates on the Leaderboard! We've added claude-3-7-sonnet-20250219 and kimi-k1.5-Preview performances. Currently, claude-3-7-sonnet-20250219🥇, Gemini-2.0-Flash-Thinking-exp-0121🥈, and o1🥉 on EMMA-Mini.
🔥 [2025-01-28]: We've added Gemini-2.0-Flash-Thinking-0121 and QVQ-72B-Preview performances on the Leaderboard!
🚀 [2025-01-09]: We released EMMA, a benchmark for advanced multimodal reasoning. 🥳The ability to organically reason over and with both text and images is a pillar of human intelligence, yet the ability of Multimodal Large Language Models (MLLMs) to perform such multimodal reasoning remains under-explored. Existing benchmarks often emphasize text-dominant reasoning or rely on shallow visual cues, failing to adequately assess integrated visual and textual reasoning.
We introduce , a benchmark targeting organic multimodal reasoning across mathematics, physics, chemistry, and coding. EMMA tasks demand advanced cross-modal reasoning that cannot be addressed by reasoning independently in each modality, offering an enhanced test suite for MLLMs' reasoning capabilities.
Our evaluation of state-of-the-art MLLMs on EMMA reveals significant limitations in handling complex multimodal and multi-step reasoning tasks, with even advanced techniques like Chain-of-Thought prompting and test-time compute scaling underperforming. These findings underscore the need for improved multimodal architectures and training paradigms to close the gap between human and model reasoning in multimodality.
Reset | EMMA | EMMA-Mini | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Name | Size | CoT prompting | Overall | Math | Physics | Chemistry | Coding | Overall | Math | Physics | Chemistry | Coding |
Overall results of different models on the EMMA leaderboard. The best-performing model in each category is in-bold, and the second best is underlined.
EMMA is composed of 2,788 problems , of which 1,796 are newly constructed , across four domains: math, physics, chemistry, and coding. To provide fine-grained insights into how MLLMs might fail in multimodal reasoning, we assign labels to each problem in our benchmark. These labels are either created by domain experts or assigned by GPT-4o and subsequently verified by experts. As shown in Category Figure, questions in EMMA assess a wide array of multimodal reasoning skills. For example, the pattern inference problem in math challenges models to identify and generalize visual patterns; the visual decomposition simulation problem in physics requires graphically decomposing forces to determine resultant effects; the reaction simulation problem in chemistry demands precise interpretation and simulation of electron movement; the 3D visualization problem in coding.
We also try various test-time compute scaling strategies. While they tend to boost model performance, they are far from enough to close the gap to human-level performance. The best model and scaling strategy configuration we try still trails humans by 27%.
Using the abels for each question based on the multimodal skills it assesses, we find that CoT prompting hurts performance on visual-reasoning-heavy tasks, while it benefits closed-source models on tasks where textual CoT is theoretically useful.
Distribution of error types made by o1 on the math and coding portions of EMMA-mini. The majority of errors arise in visual reasoning.
Click on the image to display the answer. Please check complete Coding questions here.
@article{hao2025can,
title={Can MLLMs Reason in Multimodality? EMMA: An Enhanced MultiModal ReAsoning Benchmark},
author={Hao, Yunzhuo and Gu, Jiawei and Wang, Huichen Will and Li, Linjie and Yang, Zhengyuan and Wang, Lijuan and Cheng, Yu},
journal={arXiv preprint arXiv:2501.05444},
year={2025}
}