SpatialT2I - Project Page

Abstract

Recent progress in text-to-image generation has greatly advanced visual fidelity and creativity, but it has also imposed higher demands on prompt complexity—particularly in encoding intricate spatial relationships.In such cases, achieving satisfactory results often requires multiple sampling attempts. To address this challenge, we introduce a novel method that strengthens the spatial understanding of current image generation models.We first construct the SpatialReward-Dataset with over 80k preference pairs. Building on this dataset, we build SpatialScore, a reward model designed to evaluate the accuracy of spatial relationships in text-to-image generation, achieving performance that even surpasses leading proprietary models on spatial evaluation. We further demonstrate that this reward model effectively enables online reinforcement learning for the complex spatial generation.Extensive experiments across multiple benchmarks show that our specialized reward model yields significant and consistent gains in spatial understanding for image generation.

Methodology Pipeline

Overview of the SpatialReward-Dataset construction and training process.

GRPO training pipeline for enhancing spatial unserstanding.

Experimental Results

Table 1: Pairwise-accuracy comparisons on reward evaluation benchmark

Setting	Image Reward Models						Qwen2.5-VL Series			Proprietary Models		Spatial Score
Setting	Image Reward	Pick score	HPS v2.1	VQA Score	Unified Reward	HPS v3	7B	32B	72B	GPT-5	Gemini 2.5 pro	Spatial Score
1 Pert.	0.439	0.461	0.433	0.567	0.583	0.606	0.572	0.644	0.711	0.855	0.933	0.939
2–3 Pert.	0.513	0.551	0.491	0.638	0.627	0.697	0.632	0.724	0.816	0.924	0.968	0.978
Overall	0.479	0.509	0.463	0.603	0.605	0.652	0.602	0.685	0.764	0.890	0.951	0.958

"1 Pert." and "2–3 Pert." denote subsets with one or two-three spatial perturbations applied to the perturbed prompts. Our SpatialScore outperforms both open-source and proprietary models.

Table 2: Comparisons on Generation Benchmarks

Method	Spatial Score	DPG-bench	TIIF-bench-short			TIIF-bench-long			Unibench (S)		Unibench (L)
Method	Spatial Score	Relation Spatial	BR	AR	RR	BR	AR	RR	Lay 2D	Lay 3D	Lay 2D	Lay 3D
Flux.1-dev	2.18	0.871	0.769	0.608	0.584	0.758	0.677	0.645	0.766	0.667	0.819	0.742
Flow-GRPO*	3.01	0.742	0.851	0.652	0.621	0.577	0.510	0.482	0.726	0.635	0.445	0.405
Ours	7.81	0.932	0.875	0.700	0.647	0.845	0.715	0.675	0.875	0.773	0.891	0.801

* denotes training with Geneval as the reward model. BR, AR, and RR denote basic relation, attribute+relation, and relation+reasoning. (S)/(L) denote short/long prompts.

Qualitative Comparison

Text Prompt with complex relations

Flux

Flow-GRPO
(GenEval)

Ours

Data Construction & Curation

Perfect Text Prompt with complex relations

Perturbed Text Prompt with complex relations

                @article{tang2025enhancing,

                  title={Enhancing Spatial Understanding in Image Generation via Reward Modeling},

                  author={Tang, Zhenyu and Feng, Chaoran and others},

                  journal={arXiv preprint},

                  year={2025}

                }