RoboAlign-R1: Distilled Multimodal Reward Alignment for Robot Video World Models

TL;DR

What is RoboAlign-R1?

RoboAlign-R1 is a unified framework for robot video world models that replaces weak low-level RL rewards with a distilled multimodal reward, and stabilizes long-horizon autoregressive generation with a training-free Sliding Window Re-encoding (SWR) strategy.

RobotWorldBench — 10,000 annotated video–instruction pairs across 4 robot datasets, scored on 6 fine-grained dimensions.
RoboAlign-Judge (teacher) — a Qwen3-VL-8B-Thinking multimodal judge fine-tuned on RobotWorldBench.
Student Reward — a 98M compact model distilled from the teacher, running at ~50 videos/s for online RL.
Sliding Window Re-encoding — periodically re-encodes recent predictions into fresh context, bounding KV-cache at O(W).

**Figure 1. Overview of RoboAlign-R1.** A robot-centric benchmark trains a multimodal teacher judge, which is distilled into a lightweight student reward model for efficient RL post-training. In parallel, Sliding Window Re-encoding stabilizes long-horizon autoregressive rollouts during inference.

Motivation

Two pain points of robot video world models

01

Reward misalignment

Existing world models are trained with reconstruction, MSE, LPIPS, or SSIM. These low-level proxies do not reflect what actually matters for decision making— instruction following, manipulation success, contact realism, and physics adherence. Stronger multimodal judges exist but are too expensive to query online.

02

Long-horizon error accumulation

Token-based autoregressive world models condition each step on all predicted tokens. Small per-step errors compound, causing textures to drift, contacts to lose plausibility, and object identity to degrade as the rollout grows longer.

Method

How RoboAlign-R1 works

A tokenize–predict–decode world model is post-trained with a distilled multimodal reward via GRPO, then decoded with Sliding Window Re-encoding.

RobotWorldBench

We collect candidate videos from four robot data sources and combine two annotation sources: (i) rule-based degradations of ground-truth episodes and (ii) generated rollouts from open-source image-to-video and world-model baselines.

From this pool we curate 10,000 annotated video–instruction pairs, each labelled with a raw score vector r = (r₁, …, r₆) over six dimensions with rubric ranges [3,2,1,1,1,2]:

1Instruction Following

2Manipulation Success

3Action–Outcome Consistency

4Temporal Consistency

5Contact Realism

6Physics Adherence

RobotWorldBench data distribution — Data distribution of **RobotWorldBench** across sources, instructions, and degradation types.

Teacher-to-Student Reward Distillation

The RoboAlign-Judge teacher (Qwen3-VL-8B-Thinking, fine-tuned on RobotWorldBench) outputs structured six-dimensional scores, but is too expensive to call inside an RL loop.

We distill the teacher into a lightweight student — a compact visual–text encoder with a linear scoring head, ~98M parameters, running at ~50 videos/s. A dimension-weighted Huber regression matches the teacher's normalized scores, yielding a 10× cheaper online reward.

Online iterative distillation. Every K policy updates, fresh rollouts from the current world model are scored by the teacher and added to the distillation set — countering reward hacking from distribution shift.

Student reward model architecture — Architecture of the distilled **student reward model**.

Judge radar comparison — **RoboAlign-Judge** vs. off-the-shelf VLM judges on six dimensions.

Token-based world model — **Token-based robot video world model.** Dual-branch FSQ tokenizer + 12-layer LLaMA Transformer; post-trained with GRPO under the distilled multimodal reward.

GRPO with a Distilled Multimodal Reward

A dual-branch FSQ tokenizer encodes context and dynamics tokens; action tokens are interleaved to form a unified sequence modelled by a 12-layer causal Transformer. We pre-train with next-token prediction on dynamics tokens only.

During RL post-training, a group of G rollouts is sampled, each scored by the student reward. Group-normalized advantages feed a clipped GRPO objective with a KL penalty to the SFT reference:

L_GRPO = − E[ min( ρ · A, clip(ρ, 1±ε) · A ) ] + β · KL(π || π₀)

Composite reward: R = Σ_k w_k · g_ψ(l, ◯)_k, summing the six student-predicted dimensions.

Sliding Window Re-encoding (SWR)

Inspired by StreamingLLM's attention-sink mechanism, SWR partitions a T-frame rollout into K=⌈T/W⌉ segments. At each segment boundary, the last predicted frame is decoded to pixels and then re-encoded as fresh context tokens, resetting the autoregressive prompt.

This training-free decode–re-encode cycle truncates token-level drift across segments while keeping the active KV-cache bounded by O(W).

Stability bound. E_SWR(T) ≤ Wε + (Wε + δ)/(1−α^W) — independent of the total horizon T, vs. O(Tε) for vanilla autoregressive rollouts.

Sliding Window Re-encoding schematic — **SWR**: analogy to StreamingLLM (top) and our periodic pixel-space refresh (bottom); technical details of one refresh step on the right.

Results

Quantitative & Qualitative Evaluation

RoboAlign-R1 leads across all six judged dimensions on RobotWorldBench and wins on pixel-level metrics on both RT-1 and BridgeData V2.

RQ1 · Main results on RobotWorldBench

Method	Task Alignment			Physical Realism			Total ↑
Method	Instr. ↑	Manip. ↑	Act.-Out. ↑	Temp. ↑	Contact ↑	Phys. ↑	Total ↑
Real videos	3.00	2.00	1.00	1.00	1.00	2.00	10.00
Closed video models
Kling 2.6	2.42	1.38	0.46	0.58	0.82	1.18	6.84
Runway Gen-4.5	2.34	1.29	0.43	0.55	0.79	1.12	6.52
MiniMax Hailuo 02	2.18	1.17	0.38	0.47	0.71	1.01	5.92
Luma Dream Machine	2.03	1.08	0.35	0.44	0.69	0.92	5.51
Open video models
HunyuanVideo-I2V	1.20	0.40	0.24	0.56	0.96	1.56	4.92
LTX-Video	2.28	1.32	0.52	0.26	0.28	1.00	5.66
Stable Video Diffusion XT	1.84	0.98	0.46	0.06	0.36	0.58	4.28
Mochi-1	1.83	0.91	0.31	0.39	0.61	0.82	4.87
CogVideoX-I2V	1.94	1.00	0.40	0.42	0.92	1.24	5.92
OpenSora-I2V	1.58	0.82	0.27	0.31	0.53	0.71	4.22
OpenSora-Plan-I2V	1.79	0.89	0.30	0.37	0.60	0.81	4.76
I2VGen-XL	1.56	0.66	0.32	0.00	0.44	0.56	3.54
Embodied / interactive world-model baselines
RLVR-World	2.29	1.31	0.44	0.43	0.84	1.19	6.54
iVideoGPT	2.60	1.60	0.70	0.74	0.56	1.54	7.74
RoboDreamer	2.02	1.02	0.34	0.32	0.04	0.70	4.44
Vid2World	2.18	1.08	0.40	0.22	0.98	1.04	5.90
Wan2.2-TI2V-5B (LoRA)	2.40	1.02	0.41	0.24	0.98	0.96	6.01
RoboAlign-R1 (ours)	2.72	1.72	0.72	0.78	1.00	1.58	8.52

RQ2 · Distilled reward vs. low-level RL rewards

RQ2 reward ablation — Distilled student reward delivers the strongest judge-aligned improvement across semantic and physical dimensions (a), and the best full-rollout low-level metrics on RT-1 and BridgeData V2 (b–e). **+33.8%** aggregate over the best single-metric reward (LPIPS).

RQ3 · Sliding Window Re-encoding (W=6)

Metric	Default AR	SWR (W=6)	Δ
Quality
SSIM ↑	0.7526	0.7735	+2.8%
PSNR (dB) ↑	20.49	21.11	+0.62 dB
LPIPS ↓	0.2078	0.1875	−9.8%
ROI-LPIPS ↓	—	—	−12.2%
Efficiency
Total time (s)	5.646	5.709	+1.1%
Throughput (FPS)	5.31	5.26	−0.9%
Peak seq. length	1×	0.45×	−54.8%
Peak memory	1×	0.96×	−4.2%

RQ3 SWR case — SWR maintains a near-flat per-frame latency profile while AR shows a **+6.8%** latency drift by frame 29, alongside clearly more stable long-horizon generations.

Interactive

RoboAlign-Judge — Live Inference Demo

Watch RoboAlign-Judge (Qwen3-VL-8B-Thinking fine-tuned on RobotWorldBench) evaluate a generated rollout end-to-end: it reads the instruction and the initial frame, watches the generated video, then emits a six-dimensional rubric score with a written rationale.

RoboAlign-Judge — chain-of-thought

▍

The chain-of-thought above is streamed from the actual model output (saved from Qwen3-VL-8B-Thinking + LoRA). Scores are aggregated over 5 independent sampling runs (temperature 0.6).

Qualitative

Gallery

Representative rollouts on RT-1 and BridgeData V2, and long-horizon SWR case studies.

Representative manipulation case — **Case 1.** A representative manipulation case: RoboAlign-R1 produces physically coherent sequences with accurate grasping and stable contact geometry.

RT-1 and BridgeData V2 cases — **Case 2.** Across RT-1 and BridgeData V2, RoboAlign-R1 preserves texture, shadows, and background stability, consistent with the quantitative gains.

Sliding Window Re-encoding — Live Rollout Comparison

Each animation below shows a long-horizon rollout from one of our two benchmarks, with three strategies placed left-to-right for frame-aligned comparison:

Loading animation…

Observe how the middle column (Default AR) progressively loses texture sharpness, object identity, and contact geometry as the horizon grows, while the right column (SWR) stays close to the left column (GT) thanks to periodic context refresh.

More Long-horizon SWR Cases (static)

Judge case study qualitative — **RoboAlign-Judge** case study: fine-grained per-dimension scores correlate well with human perception of failure modes.

Cite

BibTeX

@misc{anonymous2026roboalignr1,
  title  = {RoboAlign-R1: Distilled Multimodal Reward Alignment for Robot Video World Models},
  author = {Anonymous Authors},
  year   = {2026}
}