Rethinking Video Generation Model for the Embodied World
GitHub Research Paper Project Datasets Benchmark Leaderboard VideoAn Introductory Demo Video Showcasing Our Work
Abstract
Video generation models have significantly advanced embodied intelligence, unlocking new possibilities for generating diverse robot data that capture perception, reasoning, and action in the physical world. However, synthesizing high-quality videos that accurately reflect real-world robotic interactions remains challenging, and the lack of a standardized benchmark limits fair comparisons and progress. To address this gap, we introduce a comprehensive robotics benchmark, RBench, designed to evaluate robot-oriented video generation across five task domains and four distinct embodiments. It assesses both task-level correctness and visual fidelity through reproducible sub-metrics, including structural consistency, physical plausibility, and action completeness. Evaluation of 25 representative models highlights significant deficiencies in generating physically realistic robot behaviors. Furthermore, the benchmark achieves a Spearman correlation coefficient of 0.96 with human evaluations, validating its effectiveness. While RBench provides the necessary lens to identify these deficiencies, achieving physical realism requires moving beyond evaluation to address the critical shortage of high-quality training data. Driven by these insights, we introduce a refined four-stage data pipeline, resulting in RoVid-X, the largest open-source robotic dataset for video generation with 4 million annotated video clips, covering thousands of tasks and enriched with comprehensive physical property annotations. Collectively, this synergistic ecosystem of evaluation and data establishes a robust foundation for rigorous assessment and scalable training of video models, accelerating the evolution of embodied AI toward general intelligence.
Task-wise Model Comparison
Embodiment-wise Model Comparison
RBench Quantitative Results
Evaluations across task-oriented and embodiment-specific dimensions for 25 models from open-source, commercial, and robotics-specific families. The "Avg." column shows the mean score across nine indicators.
| Models | Rank | Avg. | Tasks | Embodiments | |||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Manipulation | Spatial | Multi-entity | Long-horizon | Reasoning | Single arm | Dual arm | Quadruped | Humanoid | |||
| Open-source | |||||||||||
| Wan2.2_A14B | 8 | 0.507 | 0.381 | 0.454 | 0.373 | 0.501 | 0.330 | 0.608 | 0.582 | 0.690 | 0.648 |
| HunyuanVideo 1.5 | 10 | 0.460 | 0.442 | 0.316 | 0.312 | 0.438 | 0.364 | 0.513 | 0.526 | 0.634 | 0.595 |
| LongCat-Video | 11 | 0.437 | 0.372 | 0.310 | 0.220 | 0.384 | 0.186 | 0.586 | 0.576 | 0.681 | 0.621 |
| Wan2.1_14B | 14 | 0.399 | 0.344 | 0.268 | 0.282 | 0.335 | 0.205 | 0.464 | 0.497 | 0.595 | 0.599 |
| LTX-2 | 15 | 0.381 | 0.284 | 0.304 | 0.233 | 0.386 | 0.164 | 0.453 | 0.424 | 0.622 | 0.555 |
| Wan2.2_5B | 16 | 0.380 | 0.331 | 0.313 | 0.142 | 0.318 | 0.234 | 0.436 | 0.448 | 0.590 | 0.607 |
| SkyReels | 18 | 0.361 | 0.203 | 0.276 | 0.203 | 0.254 | 0.234 | 0.507 | 0.477 | 0.586 | 0.509 |
| LTX-Video | 19 | 0.344 | 0.302 | 0.176 | 0.210 | 0.280 | 0.241 | 0.440 | 0.456 | 0.526 | 0.464 |
| FramePack | 20 | 0.339 | 0.206 | 0.258 | 0.173 | 0.169 | 0.170 | 0.440 | 0.464 | 0.626 | 0.548 |
| HunyuanVideo | 21 | 0.303 | 0.177 | 0.180 | 0.108 | 0.147 | 0.035 | 0.454 | 0.480 | 0.625 | 0.524 |
| CogVideoX_5B | 23 | 0.256 | 0.116 | 0.112 | 0.098 | 0.212 | 0.079 | 0.338 | 0.385 | 0.465 | 0.496 |
| Commercial | |||||||||||
| Wan 2.6 | 1 | 0.607 | 0.546 | 0.656 | 0.479 | 0.514 | 0.531 | 0.666 | 0.681 | 0.723 | 0.667 |
| Seedance 1.5 Pro | 2 | 0.584 | 0.577 | 0.495 | 0.484 | 0.570 | 0.470 | 0.648 | 0.641 | 0.680 | 0.692 |
| Wan 2.5 | 3 | 0.570 | 0.527 | 0.576 | 0.402 | 0.496 | 0.437 | 0.680 | 0.634 | 0.726 | 0.654 |
| Hailuo v2 | 4 | 0.565 | 0.560 | 0.637 | 0.386 | 0.545 | 0.474 | 0.594 | 0.611 | 0.640 | 0.635 |
| Veo 3 | 5 | 0.563 | 0.521 | 0.508 | 0.430 | 0.530 | 0.504 | 0.634 | 0.610 | 0.689 | 0.637 |
| Seedance 1.0 | 6 | 0.551 | 0.542 | 0.425 | 0.448 | 0.454 | 0.442 | 0.622 | 0.641 | 0.698 | 0.686 |
| Kling 2.6 Pro | 7 | 0.534 | 0.529 | 0.598 | 0.364 | 0.530 | 0.358 | 0.570 | 0.605 | 0.637 | 0.613 |
| Sora v2 Pro# | 17 | 0.362 | 0.208 | 0.268 | 0.186 | 0.255 | 0.115 | 0.476 | 0.513 | 0.664 | 0.561 |
| Sora v1 | 22 | 0.266 | 0.151 | 0.223 | 0.111 | 0.166 | 0.139 | 0.314 | 0.324 | 0.544 | 0.419 |
| Robotics-specific | |||||||||||
| Cosmos 2.5 | 9 | 0.464 | 0.358 | 0.338 | 0.201 | 0.496 | 0.399 | 0.544 | 0.560 | 0.658 | 0.626 |
| DreamGen(gr1) | 12 | 0.420 | 0.312 | 0.372 | 0.297 | 0.334 | 0.215 | 0.564 | 0.532 | 0.579 | 0.575 |
| DreamGen(droid) | 13 | 0.405 | 0.358 | 0.348 | 0.214 | 0.316 | 0.339 | 0.499 | 0.476 | 0.542 | 0.556 |
| Vidar | 24 | 0.206 | 0.073 | 0.106 | 0.050 | 0.054 | 0.050 | 0.382 | 0.410 | 0.374 | 0.357 |
| UnifoLM-WMA-0 | 25 | 0.123 | 0.036 | 0.040 | 0.018 | 0.062 | 0.000 | 0.268 | 0.194 | 0.293 | 0.200 |
Automatic Evaluation for Various Robotic Scenarios
Overview of RBench Statistics
Comparison of Representative Robotic Video Datasets
| Dataset | Year | #Videos | #Skills | Resolution | Optical Flow | Diverse Robotic Forms | Diverse Captions |
|---|---|---|---|---|---|---|---|
| RoboTurk | 2018 | 2.1k | 2 | 480P | ✗ | ✗ | ✗ |
| RoboNet | 2019 | 162k | N/A | 240P | ✗ | ✗ | ✗ |
| BridgeData | 2021 | 7.2k | 4 | 480P | ✗ | ✗ | ✗ |
| RH20T | 2023 | 13k | 33 | 720P | ✗ | ✗ | ✗ |
| DROID | 2024 | 76k | 86 | 720P | ✗ | ✗ | ✗ |
| Open X-Embodiment | 2024 | 1.4M | 217 | 64P—720P | ✗ | ✓ | ✗ |
| RoboMIND | 2024 | 107k | 38 | 480P | ✗ | ✗ | ✗ |
| RoboCOIN | 2025 | 180k | 36 | 480P | ✗ | ✗ | ✗ |
| Galaxea | 2025 | 100k | 58 | 720P | ✗ | ✗ | ✗ |
| InternData-A1 | 2025 | 630k | 18 | 480P | ✗ | ✗ | ✗ |
| Fourier ActionNet | 2025 | 13k | 16 | 800P | ✗ | ✗ | ✗ |
| Humanoid Everyday | 2025 | 10.3k | 221 | 320P—720P | ✗ | ✗ | ✗ |
| Agibot World | 2025 | 1M | 87 | 480P | ✗ | ✗ | ✗ |
| RoVid-X (Ours) | 2026 | 4M | 1300+ | 720P | ✓ | ✓ | ✓ |
RoVid-X: Diverse Robotics Video Data and Caption Demonstration
Overview of RoVid-X Construction and Statistics
Ethics Concerns
All videos featured in these demos are either generated by models or sourced from publicly available datasets, and are intended solely for the purpose of demonstrating the technical capabilities of our research. If you believe any content infringes upon rights or raises ethical concerns, please contact us at dengyufan10@stu.pku.edu.cn, we will address the issue and remove the material promptly.
BibTeX
@misc{deng2026rethinkingvideogenerationmodel,
title={Rethinking Video Generation Model for the Embodied World},
author={Yufan Deng and Zilin Pan and Hongyu Zhang and Xiaojie Li and Ruoqing Hu and Yufei Ding and Yiming Zou and Yan Zeng and Daquan Zhou},
year={2026},
eprint={2601.15282},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2601.15282},
}