(GenEval)
Recent progress in text-to-image generation has greatly advanced visual fidelity and creativity, but it has also imposed higher demands on prompt complexity—particularly in encoding intricate spatial relationships.In such cases, achieving satisfactory results often requires multiple sampling attempts. To address this challenge, we introduce a novel method that strengthens the spatial understanding of current image generation models.We first construct the SpatialReward-Dataset with over 80k preference pairs. Building on this dataset, we build SpatialScore, a reward model designed to evaluate the accuracy of spatial relationships in text-to-image generation, achieving performance that even surpasses leading proprietary models on spatial evaluation. We further demonstrate that this reward model effectively enables online reinforcement learning for the complex spatial generation.Extensive experiments across multiple benchmarks show that our specialized reward model yields significant and consistent gains in spatial understanding for image generation.
Overview of the SpatialReward-Dataset construction and training process.
GRPO training pipeline for enhancing spatial unserstanding.
| Setting | Image Reward Models | Qwen2.5-VL Series | Proprietary Models | Spatial Score |
||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Image Reward |
Pick score |
HPS v2.1 |
VQA Score |
Unified Reward |
HPS v3 |
7B | 32B | 72B | GPT-5 | Gemini 2.5 pro |
||
| 1 Pert. | 0.439 | 0.461 | 0.433 | 0.567 | 0.583 | 0.606 | 0.572 | 0.644 | 0.711 | 0.855 | 0.933 | 0.939 |
| 2–3 Pert. | 0.513 | 0.551 | 0.491 | 0.638 | 0.627 | 0.697 | 0.632 | 0.724 | 0.816 | 0.924 | 0.968 | 0.978 |
| Overall | 0.479 | 0.509 | 0.463 | 0.603 | 0.605 | 0.652 | 0.602 | 0.685 | 0.764 | 0.890 | 0.951 | 0.958 |
"1 Pert." and "2–3 Pert." denote subsets with one or two-three spatial perturbations applied to the perturbed prompts. Our SpatialScore outperforms both open-source and proprietary models.
| Method | Spatial Score |
DPG-bench | TIIF-bench-short | TIIF-bench-long | Unibench (S) | Unibench (L) | ||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Relation Spatial |
BR | AR | RR | BR | AR | RR | Lay 2D |
Lay 3D |
Lay 2D |
Lay 3D |
||
| Flux.1-dev | 2.18 | 0.871 | 0.769 | 0.608 | 0.584 | 0.758 | 0.677 | 0.645 | 0.766 | 0.667 | 0.819 | 0.742 |
| Flow-GRPO* | 3.01 | 0.742 | 0.851 | 0.652 | 0.621 | 0.577 | 0.510 | 0.482 | 0.726 | 0.635 | 0.445 | 0.405 |
| Ours | 7.81 | 0.932 | 0.875 | 0.700 | 0.647 | 0.845 | 0.715 | 0.675 | 0.875 | 0.773 | 0.891 | 0.801 |
* denotes training with Geneval as the reward model. BR, AR, and RR denote basic relation, attribute+relation, and relation+reasoning. (S)/(L) denote short/long prompts.