To improve qualitative and quantitative evaluation in text-guided image editing, we introduce EditBench, a systematic benchmark. EditBench evaluates inpainting edits on natural and generated images exploring objects, attributes, and scenes. It is curated to capture a wide variety of language, types of images, and levels of difficulty.
Each EditBench example consists of:
- A masked input image
- An input text prompt
- A high-quality output image that can be used as reference for automatic metrics.
To provide insight into the relative strengths and weaknesses of different models, edit prompts are categorized along three axes: attributes (material, color, shape, size, count), objects (common, rare, text rendering), and scenes (indoor, outdoor, realistic, paintings).
Related experimental findings:
Through extensive human evaluation on EditBench, we find that object-masking during training leads to across-the-board improvements in text-image alignment – such that Imagen Editor (our own model) is preferred over DALL-E 2 and Stable Diffusion – and, as a cohort, these models are better at object-rendering than text-rendering, and handle material/color/size attributes better than count/shape attributes. See our website for more details about this research.
The data is packaged in a one-button download link in editbench.tar.gz, where we include the data, the annotation, and a README that details the content of the data.