An Introductory Demo Video Showcasing Our Work

Abstract

Video generation models have significantly advanced embodied intelligence, unlocking new possibilities for generating diverse robot data that capture perception, reasoning, and action in the physical world. However, synthesizing high-quality videos that accurately reflect real-world robotic interactions remains challenging, and the lack of a standardized benchmark limits fair comparisons and progress. To address this gap, we introduce a comprehensive robotics benchmark, RBench, designed to evaluate robot-oriented video generation across five task domains and four distinct embodiments. It assesses both task-level correctness and visual fidelity through reproducible sub-metrics, including structural consistency, physical plausibility, and action completeness. Evaluation of 25 representative models highlights significant deficiencies in generating physically realistic robot behaviors. Furthermore, the benchmark achieves a Spearman correlation coefficient of 0.96 with human evaluations, validating its effectiveness. While RBench provides the necessary lens to identify these deficiencies, achieving physical realism requires moving beyond evaluation to address the critical shortage of high-quality training data. Driven by these insights, we introduce a refined four-stage data pipeline, resulting in RoVid-X, the largest open-source robotic dataset for video generation with 4 million annotated video clips, covering thousands of tasks and enriched with comprehensive physical property annotations. Collectively, this synergistic ecosystem of evaluation and data establishes a robust foundation for rigorous assessment and scalable training of video models, accelerating the evolution of embodied AI toward general intelligence.

Overview of RBench and RoVid-X

Teaser Figure

Task-wise Model Comparison

Embodiment-wise Model Comparison

RBench Quantitative Results

Evaluations across task-oriented and embodiment-specific dimensions for 25 models from open-source, commercial, and robotics-specific families. The "Avg." column shows the mean score across nine indicators.

Models Rank Avg. Tasks Embodiments
Manipulation Spatial Multi-entity Long-horizon Reasoning Single arm Dual arm Quadruped Humanoid
Open-source
Wan2.2_A14B 8 0.507 0.381 0.454 0.373 0.501 0.330 0.608 0.582 0.690 0.648
HunyuanVideo 1.5 10 0.460 0.442 0.316 0.312 0.438 0.364 0.513 0.526 0.634 0.595
LongCat-Video 11 0.437 0.372 0.310 0.220 0.384 0.186 0.586 0.576 0.681 0.621
Wan2.1_14B 14 0.399 0.344 0.268 0.282 0.335 0.205 0.464 0.497 0.595 0.599
LTX-2 15 0.381 0.284 0.304 0.233 0.386 0.164 0.453 0.424 0.622 0.555
Wan2.2_5B 16 0.380 0.331 0.313 0.142 0.318 0.234 0.436 0.448 0.590 0.607
SkyReels 18 0.361 0.203 0.276 0.203 0.254 0.234 0.507 0.477 0.586 0.509
LTX-Video 19 0.344 0.302 0.176 0.210 0.280 0.241 0.440 0.456 0.526 0.464
FramePack 20 0.339 0.206 0.258 0.173 0.169 0.170 0.440 0.464 0.626 0.548
HunyuanVideo 21 0.303 0.177 0.180 0.108 0.147 0.035 0.454 0.480 0.625 0.524
CogVideoX_5B 23 0.256 0.116 0.112 0.098 0.212 0.079 0.338 0.385 0.465 0.496
Commercial
Wan 2.6 1 0.607 0.546 0.656 0.479 0.514 0.531 0.666 0.681 0.723 0.667
Seedance 1.5 Pro 2 0.584 0.577 0.495 0.484 0.570 0.470 0.648 0.641 0.680 0.692
Wan 2.5 3 0.570 0.527 0.576 0.402 0.496 0.437 0.680 0.634 0.726 0.654
Hailuo v2 4 0.565 0.560 0.637 0.386 0.545 0.474 0.594 0.611 0.640 0.635
Veo 3 5 0.563 0.521 0.508 0.430 0.530 0.504 0.634 0.610 0.689 0.637
Seedance 1.0 6 0.551 0.542 0.425 0.448 0.454 0.442 0.622 0.641 0.698 0.686
Kling 2.6 Pro 7 0.534 0.529 0.598 0.364 0.530 0.358 0.570 0.605 0.637 0.613
Sora v2 Pro# 17 0.362 0.208 0.268 0.186 0.255 0.115 0.476 0.513 0.664 0.561
Sora v1 22 0.266 0.151 0.223 0.111 0.166 0.139 0.314 0.324 0.544 0.419
Robotics-specific
Cosmos 2.5 9 0.464 0.358 0.338 0.201 0.496 0.399 0.544 0.560 0.658 0.626
DreamGen(gr1) 12 0.420 0.312 0.372 0.297 0.334 0.215 0.564 0.532 0.579 0.575
DreamGen(droid) 13 0.405 0.358 0.348 0.214 0.316 0.339 0.499 0.476 0.542 0.556
Vidar 24 0.206 0.073 0.106 0.050 0.054 0.050 0.382 0.410 0.374 0.357
UnifoLM-WMA-0 25 0.123 0.036 0.040 0.018 0.062 0.000 0.268 0.194 0.293 0.200

Automatic Evaluation for Various Robotic Scenarios

Overview of RBench Statistics

R-Bench Statistics

Comparison of Representative Robotic Video Datasets

Dataset Year #Videos #Skills Resolution Optical Flow Diverse Robotic Forms Diverse Captions
RoboTurk 2018 2.1k 2 480P
RoboNet 2019 162k N/A 240P
BridgeData 2021 7.2k 4 480P
RH20T 2023 13k 33 720P
DROID 2024 76k 86 720P
Open X-Embodiment 2024 1.4M 217 64P—720P
RoboMIND 2024 107k 38 480P
RoboCOIN 2025 180k 36 480P
Galaxea 2025 100k 58 720P
InternData-A1 2025 630k 18 480P
Fourier ActionNet 2025 13k 16 800P
Humanoid Everyday 2025 10.3k 221 320P—720P
Agibot World 2025 1M 87 480P
RoVid-X (Ours) 2026 4M 1300+ 720P

RoVid-X: Diverse Robotics Video Data and Caption Demonstration

Diverse Data
Diverse Caption
Task caption: pick up the blue cup and place it into the bowl
Short caption: The robot uses its right arm to pick up the blue cup and places it into the bowl on the table.
Detailed caption: The video opens with a top-down view of a table covered in a red-and-white checkered tablecloth, featuring a pink bowl at its center and a blue mug to the right. Two robotic arms are positioned over the table—one on the left remains stationary, while the right arm hovers near the mug. The background includes a white backdrop, a person in a cartoon-patterned sweater, and a desk cluttered with computers and electronics. The right robotic arm’s gripper closes around the blue mug, lifting it smoothly. It moves the mug horizontally to align with the pink bowl’s center, then lowers it gently into the bowl. After ensuring the mug is securely placed, the gripper releases, and the right arm retracts to its starting position. The left arm stays motionless throughout. By the end, the blue mug rests inside the pink bowl, completing the task. The robot demonstrates precise object manipulation and accurate pick-and-place capabilities, with the background elements remaining consistent as the focus stays on the arm’s deliberate actions.

Overview of RoVid-X Construction and Statistics

RoVidX-4M Overview

Ethics Concerns

All videos featured in these demos are either generated by models or sourced from publicly available datasets, and are intended solely for the purpose of demonstrating the technical capabilities of our research. If you believe any content infringes upon rights or raises ethical concerns, please contact us at dengyufan10@stu.pku.edu.cn, we will address the issue and remove the material promptly.

BibTeX

@misc{deng2026rethinkingvideogenerationmodel,
      title={Rethinking Video Generation Model for the Embodied World}, 
      author={Yufan Deng and Zilin Pan and Hongyu Zhang and Xiaojie Li and Ruoqing Hu and Yufei Ding and Yiming Zou and Yan Zeng and Daquan Zhou},
      year={2026},
      eprint={2601.15282},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2601.15282}, 
}