ConvApparel: A Benchmark Dataset and Validation Framework for User Simulators in Conversational Recommenders

Ofer Meshi

Krisztian Balog

Sally Goldman

Avi Caciularu

Guy Tennenholtz

Jihwan Jeong

Amir Globerson

Craig Boutilier

Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (EACL-26), Rabat, Morocco (2026), pp. 5270-5304

Google Scholar

Abstract

LLM-based user simulators are a scalable solution for improving conversational AI, but a critical realism gap undermines their effectiveness. To close this gap, we introduce a framework for building and validating high-fidelity simulators. We present a novel dataset of human-AI shopping conversations designed to capture a wide spectrum of user experiences. To measure fidelity, we propose a hybrid evaluation protocol that combines statistical alignment with a learned, discriminator-based Human-Likeness Score. Our most sophisticated simulator, trained via reinforcement learning with iterative critique, achieves a significant leap in realism. Critically, we demonstrate through counterfactual validation that our simulator—trained exclusively on optimal interactions—realistically adapts its behavior to suboptimal system responses, mirroring real user reactions and marking a key advance in creating reliable simulators for robust AI development.

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

ConvApparel: A Benchmark Dataset and Validation Framework for User Simulators in Conversational Recommenders

Abstract

Meet the teams driving innovation