PhysisForcing

Physics Reinforced World Simulator for Robotic Manipulation

Peking University   ·   NVIDIA

* Equal Contribution    Co-Project Lead    Corresponding Author

arXiv Code Video

An Introductory Demo Video Showcasing Our Work

Abstract

Video generation models have emerged as a promising paradigm for embodied world simulation. However, both general-domain video generators and robot-specific fine-tuned models can still produce physically implausible manipulations — including discontinuous motion trajectories and inconsistent robot–object interactions — which limits their reliability as world simulators. Through extensive experiments, we find that such physical instability mainly arises from two factors: deformation of moving objects and implausible spatio-temporal correlations among interacting entities. Specifically, generated motion trajectories often exhibit severe object deformation, while the physical relations between objects — particularly during interactions — frequently violate real-world dynamics.

Building on this observation, we propose PhysisForcing, a scalable training framework that strengthens physical consistency by focusing supervision on physics-informative regions through joint optimization of pixel-level and semantic-level features. The framework consists of a pixel-level trajectory alignment loss, which supervises DiT features using reference point trajectories, and a semantic-level relational alignment loss, which aligns DiT features with inter-region relations from a frozen video understanding encoder. Extensive experiments on R-Bench, PAI-Bench, and EZS-Bench show that PhysisForcing consistently improves embodied video generation over strong baselines, lifting the Wan2.2-I2V-A14B and Cosmos3-Nano base models on R-Bench by +22.3% and +9.2% (+7.1% and +3.7% over vanilla finetuning), with the Cosmos3-Nano variant attaining the best overall score. Beyond generation, as a world model under the WorldArena action-planner protocol it raises the closed-loop success rate from 16.0% to 24.0% and further improves downstream policy success, indicating that physically aligned video models yield stronger representations for robotic manipulation.

Overview of PhysisForcing

Region-focused hierarchical physics alignment injects pixel-level motion consistency and semantic-level relational consistency into video generation training, producing manipulation videos that are both visually realistic and physically plausible.

Teaser Figure

Performance at a Glance

Applied as a training-time framework on standard diffusion video backbones, PhysisForcing improves physical plausibility over the corresponding base model on every benchmark we evaluate, from embodied video generation to downstream world-action modeling.

Percentages are relative gains of PF-Wan14B over the finetuned Wan2.2-A14B baseline (video benchmarks) and of PF-Wan5B over the finetuned Wan2.2-5B baseline (WorldArena IDM).

Key Contributions

Hierarchical Formulation

We cast physical plausibility as a hierarchical problem and align both pixel-level point trajectories and semantic-level inter-region relations on the DiT feature.

Region-Focused Supervision

A depth-aware motion mask localizes physics-informative regions, concentrating supervision on manipulators, objects, and contacts rather than all pixels uniformly.

No Additional Inference Overhead

All auxiliary models are used only at training and discarded afterwards, so the method adds no additional inference cost while also strengthening downstream policy learning.

Method Overview

PhysisForcing first identifies physics-informative regions where robot–object interactions occur, then applies two complementary training signals on the DiT feature.

Method Architecture

Pixel-level Physics Alignment

Using point tracking (CoTracker3), we supervise the per-point trajectories implied by the DiT feature against dense reference trajectories. A masked MSE over predicted and reference coordinates keeps local motion continuous and contact-compatible on the manipulator and the manipulated object.

Semantic-level Physics Alignment

We align the token-to-token similarity matrix of physics-informative tokens between the DiT side and a frozen self-supervised video understanding encoder. Transferring the encoder's relational structure encourages globally consistent interactions — e.g., a grasped object stays coupled with the gripper, and a pushed object moves away.

From Implausible to Plausible

Given the same input image and prompt, vanilla finetuning still leaves physical violations — unstable grasps, object drift, broken contact. Adding PhysisForcing restores physically plausible motion. We show this on two backbones below.
Hover an input frame to read its full prompt.

Qualitative Comparison

PhysisForcing against five strong video generators on identical inputs, followed by a gallery of additional generations.
Use the arrows or dots to browse cases.

Video Generation · Cross-Embodiment Generalization

Video Generation · Cross-Task Generalization

More Results from PhysisForcing

Quantitative Results — Video Generation

PhysisForcing applied to two backbones — Wan2.2-I2V-A14B and Cosmos 3-nano — on three embodied video benchmarks. Each table reports the finetuned (ft) baseline and the corresponding + PhysisForcing result against strong external baselines.

R-Bench Tasks / Embodiments / overall Avg.
ModelTasksEmb.Avg.
Veo 3.149.964.356.3
Hailuo v252.062.056.5
Cosmos 3-super51.266.758.1
Seedance 1.5 Pro51.966.558.4
Wan 2.654.568.460.7
Abot-PhysWorld48.957.952.9
Wan2.2-A14B40.863.250.7
Wan2.2-A14B (ft)52.564.757.9
PF-Wan14B56.369.062.0
Cosmos 3-nano (ft)55.469.161.5
PF-Cosmos58.271.063.8
PAI-Bench (robot) Quality / Domain / overall Avg.
ModelQualityDomainAvg.
Wan 2.575.4886.4480.96
GigaWorld-075.9185.8380.87
Veo 3.177.4083.5080.45
WoW-Wan 14B76.0583.0179.53
Sora v2 Pro76.7976.2676.52
Abot-PhysWorld76.7693.0684.91
Wan2.2-A14B76.1581.7078.93
Wan2.2-A14B (ft)75.3884.4279.90
PF-Wan14B76.2688.2081.73
Cosmos 3-nano (ft)76.5291.5484.03
PF-Cosmos77.0893.2685.17
EZS-Bench Quality / Domain / overall Avg.
ModelQualityDomainAvg.
WoW-Wan 14B76.0979.5177.80
GigaWorld-072.7278.2675.49
Cosmos-Predict 2.570.8976.9873.94
UnifoLM-WMA-073.5552.3262.94
Kling 2.6-Pro78.0580.7279.39
Abot-PhysWorld76.9483.6680.30
Wan2.2-A14B76.8977.4277.16
Wan2.2-A14B (ft)76.1281.9579.04
PF-Wan14B76.5884.4980.54
Cosmos 3-nano (ft)77.4283.1680.29
PF-Cosmos76.9585.2081.08

Quantitative Results — Policy Learning

PhysisForcing as a video backbone for world-action modeling: closed-loop success on RoboTwin 2.0 (Fast-WAM) and on the WorldArena action-planner (IDM) protocol (Wan2.2-5B).

RoboTwin 2.0 (200 rollouts) Fast-WAM backbone · success rate
TaskFast-WAM+ PFΔ
place_empty_cup41.5%63.0%+21.5%
press_stapler49.0%60.0%+11.0%
grab_roller58.5%63.0%+4.5%
shake_bottle97.5%94.5%−3.0%
adjust_bottle93.0%93.0%0.0%
stack_bowls_two69.5%63.0%−6.5%
Average68.2%72.8%+4.6%
WorldArena · Action Planner (IDM) closed-loop success rate
ModelTask 1Task 2Avg.
Genie Envisioner10.0%20.0%15.0%
TesserAct1.0%35.0%18.0%
RoboMaster8.0%20.0%14.0%
Vidar2.0%19.0%10.5%
WoW20.0%21.0%20.5%
Wan2.2-5B (base)12.0%20.0%16.0%
PF-Wan5B22.0%26.0%24.0%

Ethics Concerns

All videos featured in these demos are either generated by models or sourced from publicly available datasets, and are intended solely for the purpose of demonstrating the technical capabilities of our research. If you believe any content infringes upon rights or raises ethical concerns, please contact us and we will address the issue and remove the material promptly.

BibTeX

@article{physisforcing2026,
  title   = {PhysisForcing: Physics Reinforced World Simulator for Robotic Manipulation},
  author  = {Peiwen Zhang and Yufan Deng and Shangkun Sun and Juncheng Ma and Duomin Wang and Jonas Du and Zilin Pan and Ye Huang and Hao Liang and Songyan Huang and Ruihua Zhang and Enze Xie and Ming-Yu Liu and Daquan Zhou},
  journal = {arXiv preprint arXiv:2606.28128},
  year    = {2026}}