DixitWorld: Evaluating Multimodal Abductive Reasoning in Vision-Language Models with Multi-Agent Dixit Gameplay
Published in ACL 2026 (Main Conference), 2026
Authors: Yunxiang Mo, Tianshi Zheng, Qing Zong, Jiayu Liu, Baixuan Xu, Yauwai Yim, Chunkit Chan, Jiaxin Bai, Yangqiu Song.
Venue: ACL 2026 — Main Conference. AC meta-review 9/10 (Strong Accept); recommended for oral presentation (decision pending).
Extended version of the DixitWorld workshop paper at EMNLP 2025 Workshop (Spotlight).
Links: [OpenReview]
What’s new in the ACL version
Beyond the workshop paper, this version contributes:
- A larger and finer-grained DixitBench: expanded from 168 to 252 questions (84 images × 3 difficulty tiers — Easy / Medium / Hard) with systematically controlled distractor difficulty.
- A frontier-scale ablation: a new 72B-parameter open-source VLM is added to confirm that the storyteller–listener asymmetry persists at the open-source frontier and is not an artifact of model capacity.
- Calibration and sensitivity analyses: additional studies (Appendices) showing the structural nature of the asymmetry, with r = 0.947 correlation between DixitBench and DixitArena.
- Sharper headline result: 78% of storyteller rounds yield zero points, while listener accuracy reaches 75.6% for the best model — a precise quantification of the generation–discrimination gap.
- Three-fold contribution framing: (i) DixitArena + DixitBench as complementary benchmarks under adversarial conditions, (ii) a critical storyteller–listener asymmetry, and (iii) extensive analyses showing the gap is structural, not artifactual.
Abstract
We introduce DixitWorld, an evaluation framework for assessing multimodal abductive reasoning in vision-language models (VLMs). DixitWorld has two components:
- DixitArena — a dynamic multi-agent setting in which models alternate between generating cryptic clues (storyteller) and selecting the target image from alternatives (listener).
- DixitBench — a static QA benchmark that isolates the listener task for efficient, controlled assessment.
We find that smaller open-source models often excel as creative storytellers, producing imaginative but less discriminative clues, while larger proprietary models show stronger overall performance — particularly as listeners. Performance on DixitBench strongly correlates with listener results in DixitArena, validating it as a reliable proxy for hypothesis selection. Our findings reveal a key tradeoff between generative creativity and discriminative understanding in multimodal abductive reasoning — a central challenge for developing more balanced and capable vision-language agents.
Recommended citation: Yunxiang Mo, Tianshi Zheng, Qing Zong, Jiayu Liu, Baixuan Xu, Yauwai Yim, Chunkit Chan, Jiaxin Bai, Yangqiu Song. (2026). "DixitWorld: Evaluating Multimodal Abductive Reasoning in Vision-Language Models with Multi-Agent Dixit Gameplay." Proceedings of ACL 2026 (Main Conference).
Download Paper
