PhysisForcing

Physics Reinforced World Simulator for Robotic Manipulation

Peiwen Zhang^* Yufan Deng^* Shangkun Sun^* Juncheng Ma Duomin Wang^† Jonas Du Zilin Pan Ye Huang Hao Liang Songyan Huang Ruihua Zhang Enze Xie^† Ming-Yu Liu Daquan Zhou^‡

Peking University · NVIDIA

^* Equal Contribution ^† Co-Project Lead ^‡ Corresponding Author

arXiv Code Video

An Introductory Demo Video Showcasing Our Work

Abstract

Video generation models have emerged as a promising paradigm for embodied world simulation. However, both general-domain video generators and robot-specific fine-tuned models can still produce physically implausible manipulations — including discontinuous motion trajectories and inconsistent robot–object interactions — which limits their reliability as world simulators. Through extensive experiments, we find that such physical instability mainly arises from two factors: deformation of moving objects and implausible spatio-temporal correlations among interacting entities. Specifically, generated motion trajectories often exhibit severe object deformation, while the physical relations between objects — particularly during interactions — frequently violate real-world dynamics.

Building on this observation, we propose PhysisForcing, a scalable training framework that strengthens physical consistency by focusing supervision on physics-informative regions through joint optimization of pixel-level and semantic-level features. The framework consists of a pixel-level trajectory alignment loss, which supervises DiT features using reference point trajectories, and a semantic-level relational alignment loss, which aligns DiT features with inter-region relations from a frozen video understanding encoder. Extensive experiments on R-Bench, PAI-Bench, and EZS-Bench show that PhysisForcing consistently improves embodied video generation over strong baselines, lifting the Wan2.2-I2V-A14B and Cosmos3-Nano base models on R-Bench by +22.3% and +9.2% (+7.1% and +3.7% over vanilla finetuning), with the Cosmos3-Nano variant attaining the best overall score. Beyond generation, as a world model under the WorldArena action-planner protocol it raises the closed-loop success rate from 16.0% to 24.0% and further improves downstream policy success, indicating that physically aligned video models yield stronger representations for robotic manipulation.

Overview of PhysisForcing

Region-focused hierarchical physics alignment injects pixel-level motion consistency and semantic-level relational consistency into video generation training, producing manipulation videos that are both visually realistic and physically plausible.

Performance at a Glance

Applied as a training-time framework on standard diffusion video backbones, PhysisForcing improves physical plausibility over the corresponding base model on every benchmark we evaluate, from embodied video generation to downstream world-action modeling.

details ↓

R-Bench

Avg · over ft baseline

Domain · over ft baseline

Domain · over ft baseline

WorldArena Action Planner · IDM

Percentages are relative gains of PF-Wan14B over the finetuned Wan2.2-A14B baseline (video benchmarks) and of PF-Wan5B over the finetuned Wan2.2-5B baseline (WorldArena IDM).

Key Contributions

Hierarchical Formulation

We cast physical plausibility as a hierarchical problem and align both pixel-level point trajectories and semantic-level inter-region relations on the DiT feature.

Region-Focused Supervision

A depth-aware motion mask localizes physics-informative regions, concentrating supervision on manipulators, objects, and contacts rather than all pixels uniformly.

No Additional Inference Overhead

All auxiliary models are used only at training and discarded afterwards, so the method adds no additional inference cost while also strengthening downstream policy learning.

Method Overview

PhysisForcing first identifies physics-informative regions where robot–object interactions occur, then applies two complementary training signals on the DiT feature.

Pixel-level Physics Alignment

Using point tracking (CoTracker3), we supervise the per-point trajectories implied by the DiT feature against dense reference trajectories. A masked MSE over predicted and reference coordinates keeps local motion continuous and contact-compatible on the manipulator and the manipulated object.

Semantic-level Physics Alignment

We align the token-to-token similarity matrix of physics-informative tokens between the DiT side and a frozen self-supervised video understanding encoder. Transferring the encoder's relational structure encourages globally consistent interactions — e.g., a grasped object stays coupled with the gripper, and a pushed object moves away.

From Implausible to Plausible

Given the same input image and prompt, vanilla finetuning still leaves physical violations — unstable grasps, object drift, broken contact. Adding PhysisForcing restores physically plausible motion. We show this on two backbones below.
Hover an input frame to read its full prompt.

Qualitative Comparison

PhysisForcing against five strong video generators on identical inputs, followed by a gallery of additional generations.
Use the arrows or dots to browse cases.

Video Generation · Cross-Embodiment Generalization

Video Generation · Cross-Task Generalization

More Results from PhysisForcing

Quantitative Results — Video Generation

PhysisForcing applied to two backbones — Wan2.2-I2V-A14B and Cosmos 3-nano — on three embodied video benchmarks. Each table reports the finetuned (ft) baseline and the corresponding + PhysisForcing result against strong external baselines.

R-Bench Tasks / Embodiments / overall Avg.

Model	Tasks	Emb.	Avg.
Veo 3.1	49.9	64.3	56.3
Hailuo v2	52.0	62.0	56.5
Cosmos 3-super	51.2	66.7	58.1
Seedance 1.5 Pro	51.9	66.5	58.4
Wan 2.6	54.5	68.4	60.7
Abot-PhysWorld	48.9	57.9	52.9
Wan2.2-A14B	40.8	63.2	50.7
Wan2.2-A14B (ft)	52.5	64.7	57.9
PF-Wan14B	56.3	69.0	62.0
Cosmos 3-nano (ft)	55.4	69.1	61.5
PF-Cosmos	58.2	71.0	63.8

PAI-Bench (robot) Quality / Domain / overall Avg.

Model	Quality	Domain	Avg.
Wan 2.5	75.48	86.44	80.96
GigaWorld-0	75.91	85.83	80.87
Veo 3.1	77.40	83.50	80.45
WoW-Wan 14B	76.05	83.01	79.53
Sora v2 Pro	76.79	76.26	76.52
Abot-PhysWorld	76.76	93.06	84.91
Wan2.2-A14B	76.15	81.70	78.93
Wan2.2-A14B (ft)	75.38	84.42	79.90
PF-Wan14B	76.26	88.20	81.73
Cosmos 3-nano (ft)	76.52	91.54	84.03
PF-Cosmos	77.08	93.26	85.17

EZS-Bench Quality / Domain / overall Avg.

Model	Quality	Domain	Avg.
WoW-Wan 14B	76.09	79.51	77.80
GigaWorld-0	72.72	78.26	75.49
Cosmos-Predict 2.5	70.89	76.98	73.94
UnifoLM-WMA-0	73.55	52.32	62.94
Kling 2.6-Pro	78.05	80.72	79.39
Abot-PhysWorld	76.94	83.66	80.30
Wan2.2-A14B	76.89	77.42	77.16
Wan2.2-A14B (ft)	76.12	81.95	79.04
PF-Wan14B	76.58	84.49	80.54
Cosmos 3-nano (ft)	77.42	83.16	80.29
PF-Cosmos	76.95	85.20	81.08

Quantitative Results — Policy Learning

PhysisForcing as a video backbone for world-action modeling: closed-loop success on RoboTwin 2.0 (Fast-WAM) and on the WorldArena action-planner (IDM) protocol (Wan2.2-5B).

RoboTwin 2.0 (200 rollouts) Fast-WAM backbone · success rate

Task	Fast-WAM	+ PF	Δ
place_empty_cup	41.5%	63.0%	+21.5%
press_stapler	49.0%	60.0%	+11.0%
grab_roller	58.5%	63.0%	+4.5%
shake_bottle	97.5%	94.5%	−3.0%
adjust_bottle	93.0%	93.0%	0.0%
stack_bowls_two	69.5%	63.0%	−6.5%
Average	68.2%	72.8%	+4.6%

WorldArena · Action Planner (IDM) closed-loop success rate

Model	Task 1	Task 2	Avg.
Genie Envisioner	10.0%	20.0%	15.0%
TesserAct	1.0%	35.0%	18.0%
RoboMaster	8.0%	20.0%	14.0%
Vidar	2.0%	19.0%	10.5%
WoW	20.0%	21.0%	20.5%
Wan2.2-5B (base)	12.0%	20.0%	16.0%
PF-Wan5B	22.0%	26.0%	24.0%

Ethics Concerns

All videos featured in these demos are either generated by models or sourced from publicly available datasets, and are intended solely for the purpose of demonstrating the technical capabilities of our research. If you believe any content infringes upon rights or raises ethical concerns, please contact us and we will address the issue and remove the material promptly.

BibTeX

@article{physisforcing2026,
  title   = {PhysisForcing: Physics Reinforced World Simulator for Robotic Manipulation},
  author  = {Peiwen Zhang and Yufan Deng and Shangkun Sun and Juncheng Ma and Duomin Wang and Jonas Du and Zilin Pan and Ye Huang and Hao Liang and Songyan Huang and Ruihua Zhang and Enze Xie and Ming-Yu Liu and Daquan Zhou},
  journal = {arXiv preprint arXiv:2606.28128},
  year    = {2026}}