RoboAlign-R1 Distilled Multimodal Reward Alignment for
Robot Video World Models

Anonymous Authors
+10.1%
RoboAlign-Judge Score
vs. strongest baseline
+2.8%
SSIM
long-horizon rollout
−9.8%
LPIPS
w/ SWR at ~1% overhead
10×
Cheaper online reward
via teacher→student
TL;DR

What is RoboAlign-R1?

RoboAlign-R1 is a unified framework for robot video world models that replaces weak low-level RL rewards with a distilled multimodal reward, and stabilizes long-horizon autoregressive generation with a training-free Sliding Window Re-encoding (SWR) strategy.

  • RobotWorldBench — 10,000 annotated video–instruction pairs across 4 robot datasets, scored on 6 fine-grained dimensions.
  • RoboAlign-Judge (teacher) — a Qwen3-VL-8B-Thinking multimodal judge fine-tuned on RobotWorldBench.
  • Student Reward — a 98M compact model distilled from the teacher, running at ~50 videos/s for online RL.
  • Sliding Window Re-encoding — periodically re-encodes recent predictions into fresh context, bounding KV-cache at O(W).
Overview of RoboAlign-R1
Figure 1. Overview of RoboAlign-R1. A robot-centric benchmark trains a multimodal teacher judge, which is distilled into a lightweight student reward model for efficient RL post-training. In parallel, Sliding Window Re-encoding stabilizes long-horizon autoregressive rollouts during inference.
Motivation

Two pain points of robot video world models

01

Reward misalignment

Existing world models are trained with reconstruction, MSE, LPIPS, or SSIM. These low-level proxies do not reflect what actually matters for decision making— instruction following, manipulation success, contact realism, and physics adherence. Stronger multimodal judges exist but are too expensive to query online.

02

Long-horizon error accumulation

Token-based autoregressive world models condition each step on all predicted tokens. Small per-step errors compound, causing textures to drift, contacts to lose plausibility, and object identity to degrade as the rollout grows longer.

Method

How RoboAlign-R1 works

A tokenize–predict–decode world model is post-trained with a distilled multimodal reward via GRPO, then decoded with Sliding Window Re-encoding.

RobotWorldBench

We collect candidate videos from four robot data sources and combine two annotation sources: (i) rule-based degradations of ground-truth episodes and (ii) generated rollouts from open-source image-to-video and world-model baselines.

From this pool we curate 10,000 annotated video–instruction pairs, each labelled with a raw score vector r = (r1, …, r6) over six dimensions with rubric ranges [3,2,1,1,1,2]:

1Instruction Following
2Manipulation Success
3Action–Outcome Consistency
4Temporal Consistency
5Contact Realism
6Physics Adherence
RobotWorldBench data distribution
Data distribution of RobotWorldBench across sources, instructions, and degradation types.

Teacher-to-Student Reward Distillation

The RoboAlign-Judge teacher (Qwen3-VL-8B-Thinking, fine-tuned on RobotWorldBench) outputs structured six-dimensional scores, but is too expensive to call inside an RL loop.

We distill the teacher into a lightweight student — a compact visual–text encoder with a linear scoring head, ~98M parameters, running at ~50 videos/s. A dimension-weighted Huber regression matches the teacher's normalized scores, yielding a 10× cheaper online reward.

Online iterative distillation. Every K policy updates, fresh rollouts from the current world model are scored by the teacher and added to the distillation set — countering reward hacking from distribution shift.
Student reward model architecture
Architecture of the distilled student reward model.
Judge radar comparison
RoboAlign-Judge vs. off-the-shelf VLM judges on six dimensions.
Token-based world model
Token-based robot video world model. Dual-branch FSQ tokenizer + 12-layer LLaMA Transformer; post-trained with GRPO under the distilled multimodal reward.

GRPO with a Distilled Multimodal Reward

A dual-branch FSQ tokenizer encodes context and dynamics tokens; action tokens are interleaved to form a unified sequence modelled by a 12-layer causal Transformer. We pre-train with next-token prediction on dynamics tokens only.

During RL post-training, a group of G rollouts is sampled, each scored by the student reward. Group-normalized advantages feed a clipped GRPO objective with a KL penalty to the SFT reference:

L_GRPO = − E[ min( ρ · A, clip(ρ, 1±ε) · A ) ] + β · KL(π || π0)

Composite reward: R = Σk wk · gψ(l, ◯)k, summing the six student-predicted dimensions.

Sliding Window Re-encoding (SWR)

Inspired by StreamingLLM's attention-sink mechanism, SWR partitions a T-frame rollout into K=⌈T/W⌉ segments. At each segment boundary, the last predicted frame is decoded to pixels and then re-encoded as fresh context tokens, resetting the autoregressive prompt.

This training-free decode–re-encode cycle truncates token-level drift across segments while keeping the active KV-cache bounded by O(W).

Stability bound. ESWR(T) ≤ Wε + (Wε + δ)/(1−αW) — independent of the total horizon T, vs. O(Tε) for vanilla autoregressive rollouts.
Sliding Window Re-encoding schematic
SWR: analogy to StreamingLLM (top) and our periodic pixel-space refresh (bottom); technical details of one refresh step on the right.
Results

Quantitative & Qualitative Evaluation

RoboAlign-R1 leads across all six judged dimensions on RobotWorldBench and wins on pixel-level metrics on both RT-1 and BridgeData V2.

RQ1 · Main results on RobotWorldBench

Method Task Alignment Physical Realism Total ↑
Instr. ↑Manip. ↑Act.-Out. ↑ Temp. ↑Contact ↑Phys. ↑
Real videos3.002.001.001.001.002.0010.00
Closed video models
Kling 2.62.421.380.460.580.821.186.84
Runway Gen-4.52.341.290.430.550.791.126.52
MiniMax Hailuo 022.181.170.380.470.711.015.92
Luma Dream Machine2.031.080.350.440.690.925.51
Open video models
HunyuanVideo-I2V1.200.400.240.560.961.564.92
LTX-Video2.281.320.520.260.281.005.66
Stable Video Diffusion XT1.840.980.460.060.360.584.28
Mochi-11.830.910.310.390.610.824.87
CogVideoX-I2V1.941.000.400.420.921.245.92
OpenSora-I2V1.580.820.270.310.530.714.22
OpenSora-Plan-I2V1.790.890.300.370.600.814.76
I2VGen-XL1.560.660.320.000.440.563.54
Embodied / interactive world-model baselines
RLVR-World2.291.310.440.430.841.196.54
iVideoGPT2.601.600.700.740.561.547.74
RoboDreamer2.021.020.340.320.040.704.44
Vid2World2.181.080.400.220.981.045.90
Wan2.2-TI2V-5B (LoRA)2.401.020.410.240.980.966.01
RoboAlign-R1 (ours) 2.721.720.72 0.781.001.58 8.52

RQ2 · Distilled reward vs. low-level RL rewards

RQ2 reward ablation
Distilled student reward delivers the strongest judge-aligned improvement across semantic and physical dimensions (a), and the best full-rollout low-level metrics on RT-1 and BridgeData V2 (b–e).   +33.8% aggregate over the best single-metric reward (LPIPS).

RQ3 · Sliding Window Re-encoding (W=6)

MetricDefault ARSWR (W=6)Δ
Quality
SSIM ↑0.75260.7735+2.8%
PSNR (dB) ↑20.4921.11+0.62 dB
LPIPS ↓0.20780.1875−9.8%
ROI-LPIPS ↓−12.2%
Efficiency
Total time (s)5.6465.709+1.1%
Throughput (FPS)5.315.26−0.9%
Peak seq. length0.45×−54.8%
Peak memory0.96×−4.2%
RQ3 SWR case
SWR maintains a near-flat per-frame latency profile while AR shows a +6.8% latency drift by frame 29, alongside clearly more stable long-horizon generations.
Interactive

RoboAlign-Judge — Live Inference Demo

Watch RoboAlign-Judge (Qwen3-VL-8B-Thinking fine-tuned on RobotWorldBench) evaluate a generated rollout end-to-end: it reads the instruction and the initial frame, watches the generated video, then emits a six-dimensional rubric score with a written rationale.

Instruction
RoboAlign-Judge — chain-of-thought

          
        

The chain-of-thought above is streamed from the actual model output (saved from Qwen3-VL-8B-Thinking + LoRA). Scores are aggregated over 5 independent sampling runs (temperature 0.6).

Cite

BibTeX

@misc{anonymous2026roboalignr1,
  title  = {RoboAlign-R1: Distilled Multimodal Reward Alignment for Robot Video World Models},
  author = {Anonymous Authors},
  year   = {2026}
}