SpatialT2I Icon Enhancing Spatial Understanding in Image Generation via Reward Modeling

1Peking University | 2ByteDance Seed
* Equal Contribution † Corresponding Author

Abstract

Recent progress in text-to-image generation has greatly advanced visual fidelity and creativity, but it has also imposed higher demands on prompt complexity—particularly in encoding intricate spatial relationships.In such cases, achieving satisfactory results often requires multiple sampling attempts. To address this challenge, we introduce a novel method that strengthens the spatial understanding of current image generation models.We first construct the SpatialReward-Dataset with over 80k preference pairs. Building on this dataset, we build SpatialScore, a reward model designed to evaluate the accuracy of spatial relationships in text-to-image generation, achieving performance that even surpasses leading proprietary models on spatial evaluation. We further demonstrate that this reward model effectively enables online reinforcement learning for the complex spatial generation.Extensive experiments across multiple benchmarks show that our specialized reward model yields significant and consistent gains in spatial understanding for image generation.

Methodology Pipeline

Method Pipeline

Overview of the SpatialReward-Dataset construction and training process.

Method Pipeline

GRPO training pipeline for enhancing spatial unserstanding.

Experimental Results

Table 1: Pairwise-accuracy comparisons on reward evaluation benchmark
Setting Image Reward Models Qwen2.5-VL Series Proprietary Models Spatial
Score
Image
Reward
Pick
score
HPS
v2.1
VQA
Score
Unified
Reward
HPS
v3
7B 32B 72B GPT-5 Gemini
2.5 pro
1 Pert. 0.439 0.461 0.433 0.567 0.583 0.606 0.572 0.644 0.711 0.855 0.933 0.939
2–3 Pert. 0.513 0.551 0.491 0.638 0.627 0.697 0.632 0.724 0.816 0.924 0.968 0.978
Overall 0.479 0.509 0.463 0.603 0.605 0.652 0.602 0.685 0.764 0.890 0.951 0.958

"1 Pert." and "2–3 Pert." denote subsets with one or two-three spatial perturbations applied to the perturbed prompts. Our SpatialScore outperforms both open-source and proprietary models.

Table 2: Comparisons on Generation Benchmarks
Method Spatial
Score
DPG-bench TIIF-bench-short TIIF-bench-long Unibench (S) Unibench (L)
Relation
Spatial
BR AR RR BR AR RR Lay
2D
Lay
3D
Lay
2D
Lay
3D
Flux.1-dev 2.18 0.871 0.769 0.608 0.584 0.758 0.677 0.645 0.766 0.667 0.819 0.742
Flow-GRPO* 3.01 0.742 0.851 0.652 0.621 0.577 0.510 0.482 0.726 0.635 0.445 0.405
Ours 7.81 0.932 0.875 0.700 0.647 0.845 0.715 0.675 0.875 0.773 0.891 0.801

* denotes training with Geneval as the reward model. BR, AR, and RR denote basic relation, attribute+relation, and relation+reasoning. (S)/(L) denote short/long prompts.

@article{tang2025enhancing,
  title={Enhancing Spatial Understanding in Image Generation via Reward Modeling},
  author={Tang, Zhenyu and Feng, Chaoran and others},
  journal={arXiv preprint},
  year={2025}
}