Abstract
Many modern retrieval problems are set-valued: given a broad intent, the system must return a collection of results that optimizes higher-order properties (e.g., diversity, coverage, complementarity, coherence) while staying grounded to a fixed database. These objectives are inherently non-decomposable, creating a training bottleneck because property-aligned (query, content) supervision is scarce. Reinforcement learning (RL) can optimize set-level objectives via interaction, but deploying an RL-tuned LLM for fan-out retrieval is expensive at query time. Diffusion-based generative retrieval enables efficient single-pass fan-out in embedding space, but requires objective-aligned training targets. We propose R4T (Retrieve-for-Train), which uses RL once as an objective transducer: (i) train a fan-out LLM with composite set-level rewards, (ii) synthesize objective-consistent training pairs, and (iii) train a lightweight diffusion retriever to model the conditional distribution of set-valued outputs. Across Polyvore and a large-scale music playlist dataset, R4T improves retrieval quality over strong baselines while reducing query-time fan-out latency by an order of magnitude.