RoboAlign-R1 is a unified framework for robot video world models that replaces weak low-level RL rewards with a distilled multimodal reward, and stabilizes long-horizon autoregressive generation with a training-free Sliding Window Re-encoding (SWR) strategy.
Existing world models are trained with reconstruction, MSE, LPIPS, or SSIM. These low-level proxies do not reflect what actually matters for decision making— instruction following, manipulation success, contact realism, and physics adherence. Stronger multimodal judges exist but are too expensive to query online.
Token-based autoregressive world models condition each step on all predicted tokens. Small per-step errors compound, causing textures to drift, contacts to lose plausibility, and object identity to degrade as the rollout grows longer.
A tokenize–predict–decode world model is post-trained with a distilled multimodal reward via GRPO, then decoded with Sliding Window Re-encoding.
We collect candidate videos from four robot data sources and combine two annotation sources: (i) rule-based degradations of ground-truth episodes and (ii) generated rollouts from open-source image-to-video and world-model baselines.
From this pool we curate 10,000 annotated
video–instruction pairs, each labelled with a raw score vector
r = (r1, …, r6) over six
dimensions with rubric ranges [3,2,1,1,1,2]:
The RoboAlign-Judge teacher (Qwen3-VL-8B-Thinking, fine-tuned on RobotWorldBench) outputs structured six-dimensional scores, but is too expensive to call inside an RL loop.
We distill the teacher into a lightweight student — a compact visual–text encoder with a linear scoring head, ~98M parameters, running at ~50 videos/s. A dimension-weighted Huber regression matches the teacher's normalized scores, yielding a 10× cheaper online reward.
A dual-branch FSQ tokenizer encodes context and dynamics tokens; action tokens are interleaved to form a unified sequence modelled by a 12-layer causal Transformer. We pre-train with next-token prediction on dynamics tokens only.
During RL post-training, a group of G rollouts is sampled, each scored by the student reward. Group-normalized advantages feed a clipped GRPO objective with a KL penalty to the SFT reference:
L_GRPO = − E[ min( ρ · A, clip(ρ, 1±ε) · A ) ] + β · KL(π || π0)
Composite reward: R = Σk wk · gψ(l, ◯)k, summing the six student-predicted dimensions.
Inspired by StreamingLLM's attention-sink mechanism, SWR partitions a T-frame rollout into K=⌈T/W⌉ segments. At each segment boundary, the last predicted frame is decoded to pixels and then re-encoded as fresh context tokens, resetting the autoregressive prompt.
This training-free decode–re-encode cycle truncates token-level drift
across segments while keeping the active KV-cache bounded by O(W).
ESWR(T) ≤ Wε + (Wε + δ)/(1−αW)
— independent of the total horizon T, vs. O(Tε)
for vanilla autoregressive rollouts.
RoboAlign-R1 leads across all six judged dimensions on RobotWorldBench and wins on pixel-level metrics on both RT-1 and BridgeData V2.
| Method | Task Alignment | Physical Realism | Total ↑ | ||||
|---|---|---|---|---|---|---|---|
| Instr. ↑ | Manip. ↑ | Act.-Out. ↑ | Temp. ↑ | Contact ↑ | Phys. ↑ | ||
| Real videos | 3.00 | 2.00 | 1.00 | 1.00 | 1.00 | 2.00 | 10.00 |
| Closed video models | |||||||
| Kling 2.6 | 2.42 | 1.38 | 0.46 | 0.58 | 0.82 | 1.18 | 6.84 |
| Runway Gen-4.5 | 2.34 | 1.29 | 0.43 | 0.55 | 0.79 | 1.12 | 6.52 |
| MiniMax Hailuo 02 | 2.18 | 1.17 | 0.38 | 0.47 | 0.71 | 1.01 | 5.92 |
| Luma Dream Machine | 2.03 | 1.08 | 0.35 | 0.44 | 0.69 | 0.92 | 5.51 |
| Open video models | |||||||
| HunyuanVideo-I2V | 1.20 | 0.40 | 0.24 | 0.56 | 0.96 | 1.56 | 4.92 |
| LTX-Video | 2.28 | 1.32 | 0.52 | 0.26 | 0.28 | 1.00 | 5.66 |
| Stable Video Diffusion XT | 1.84 | 0.98 | 0.46 | 0.06 | 0.36 | 0.58 | 4.28 |
| Mochi-1 | 1.83 | 0.91 | 0.31 | 0.39 | 0.61 | 0.82 | 4.87 |
| CogVideoX-I2V | 1.94 | 1.00 | 0.40 | 0.42 | 0.92 | 1.24 | 5.92 |
| OpenSora-I2V | 1.58 | 0.82 | 0.27 | 0.31 | 0.53 | 0.71 | 4.22 |
| OpenSora-Plan-I2V | 1.79 | 0.89 | 0.30 | 0.37 | 0.60 | 0.81 | 4.76 |
| I2VGen-XL | 1.56 | 0.66 | 0.32 | 0.00 | 0.44 | 0.56 | 3.54 |
| Embodied / interactive world-model baselines | |||||||
| RLVR-World | 2.29 | 1.31 | 0.44 | 0.43 | 0.84 | 1.19 | 6.54 |
| iVideoGPT | 2.60 | 1.60 | 0.70 | 0.74 | 0.56 | 1.54 | 7.74 |
| RoboDreamer | 2.02 | 1.02 | 0.34 | 0.32 | 0.04 | 0.70 | 4.44 |
| Vid2World | 2.18 | 1.08 | 0.40 | 0.22 | 0.98 | 1.04 | 5.90 |
| Wan2.2-TI2V-5B (LoRA) | 2.40 | 1.02 | 0.41 | 0.24 | 0.98 | 0.96 | 6.01 |
| RoboAlign-R1 (ours) | 2.72 | 1.72 | 0.72 | 0.78 | 1.00 | 1.58 | 8.52 |
| Metric | Default AR | SWR (W=6) | Δ |
|---|---|---|---|
| Quality | |||
| SSIM ↑ | 0.7526 | 0.7735 | +2.8% |
| PSNR (dB) ↑ | 20.49 | 21.11 | +0.62 dB |
| LPIPS ↓ | 0.2078 | 0.1875 | −9.8% |
| ROI-LPIPS ↓ | — | — | −12.2% |
| Efficiency | |||
| Total time (s) | 5.646 | 5.709 | +1.1% |
| Throughput (FPS) | 5.31 | 5.26 | −0.9% |
| Peak seq. length | 1× | 0.45× | −54.8% |
| Peak memory | 1× | 0.96× | −4.2% |
Watch RoboAlign-Judge (Qwen3-VL-8B-Thinking fine-tuned on RobotWorldBench) evaluate a generated rollout end-to-end: it reads the instruction and the initial frame, watches the generated video, then emits a six-dimensional rubric score with a written rationale.
The chain-of-thought above is streamed from the actual model output (saved from Qwen3-VL-8B-Thinking + LoRA). Scores are aggregated over 5 independent sampling runs (temperature 0.6).
Representative rollouts on RT-1 and BridgeData V2, and long-horizon SWR case studies.
Each animation below shows a long-horizon rollout from one of our two benchmarks, with three strategies placed left-to-right for frame-aligned comparison:
Observe how the middle column (Default AR) progressively loses texture sharpness, object identity, and contact geometry as the horizon grows, while the right column (SWR) stays close to the left column (GT) thanks to periodic context refresh.




@misc{anonymous2026roboalignr1,
title = {RoboAlign-R1: Distilled Multimodal Reward Alignment for Robot Video World Models},
author = {Anonymous Authors},
year = {2026}
}