DixitWorld: Evaluating Multimodal Abductive Reasoning in Vision-Language Models with Multi-Agent Dixit Gameplay

Published in EMNLP 2025 Workshop (Spotlight), 2025

Authors: Yunxiang Mo, Tianshi Zheng, Qing Zong, Jiayu Liu, Baixuan Xu, Yauwai Yim, Chunkit Chan, Jiaxin Bai, Yangqiu Song.

Venue: EMNLP 2025 Workshop — Spotlight.

An extended version was later accepted to ACL 2026 Main Conference.

Links: [arXiv]

Abstract

We introduce DixitWorld, an evaluation framework for assessing multimodal abductive reasoning in vision-language models (VLMs). DixitWorld has two components:

  • DixitArena — a dynamic multi-agent setting in which models alternate between generating cryptic clues (storyteller) and selecting the target image from alternatives (listener).
  • DixitBench — a static benchmark (168 questions, Easy / Hard) that isolates the listener task for controlled assessment.

We find that smaller open-source models often excel as creative storytellers — producing imaginative but less discriminative clues — while larger proprietary models show stronger overall performance. The results expose a fundamental tradeoff between generative creativity and discriminative understanding in multimodal reasoning.

Recommended citation: Yunxiang Mo, Tianshi Zheng, Qing Zong, Jiayu Liu, Baixuan Xu, Yauwai Yim, Chunkit Chan, Jiaxin Bai, Yangqiu Song. (2025). "DixitWorld: Evaluating Multimodal Abductive Reasoning in Vision-Language Models with Multi-Agent Dixit Gameplay." EMNLP 2025 Workshop (Spotlight).
Download Paper