WFM-Eval: An Evaluation Framework for Video World Models in Robotic Manipulation

Finding 1

Object hallucination is the main differentiator between models

Across the five families, object hallucination accounts for most of the difference between models. The temporal and realism-based measures vary much less. Standard metrics such as FVD and CLIP-Score do not capture hallucination, so it is easy to miss when models are compared on appearance alone.

Finding 2

Rankings change across datasets

Model rankings differ substantially between datasets. Predict2 ranks 1st on GR1 but 5th on AgiBot, while Hunyuan goes the other way, from 3rd on GR1 to 1st on AgiBot. Models trained on a specific domain tend to do well in distribution and worse out of it: Predict2’s HSS degrades by 81% across the two datasets, compared with 19% for Hunyuan. The effect is not explained by model size, since the 2B Predict2 outperforms the 14B Wan2.2 on GR1.

Because of this, evaluating on a single dataset can be misleading. Veo 3.1 is the only family that holds roughly the same rank on both.

Table 1 · HSS / OHR / PAR per dataset (lower HSS = better)
Model	GR1			AgiBot
Model	HSS	OHR	PAR	HSS	OHR	PAR
Predict2	.396	.108	.067	.699	.341	.128
Veo 3.1	.436	.171	.050	.630	.274	.131
Hunyuan	.482	.200	.048	.596	.256	.128
Predict2.5	.501	.210	.049	.682	.322	.128
Wan2.2	.518	.207	.058	.661	.301	.145

Green = best per dataset · Red = worst per dataset. Predict2 goes from best on GR1 to worst on AgiBot.

Finding 3

The diagnostics track downstream policy performance

We also checked whether the hallucination ranking corresponds to how well a model works as a policy. We take the pretrained Predict2 and Predict2.5 world models, attach an action head to each, and train them as policies on LIBERO, using the model’s learned representation as the policy backbone.

The Predict2 backbone does noticeably better, by 8.75 success-rate points on average (95.45% vs. 86.70%), with the largest gap on LIBERO-Long (89.6 vs. 69.4). This is consistent with what the framework reports for these two models: Predict2.5 is worse on GR1 (0.095 higher HSS, with about 52% more phantom appearances), and we think this weaker representation is part of why its policy underperforms.

Hallucination-severity distributions for Predict2 versus Predict2.5 on AgiBot and GR1. — **Figure 4.** Hallucination-severity distributions for Predict2 (blue) and Predict2.5 (pink). On AgiBot the two are essentially identical (KS p = 0.82). On GR1, Predict2.5 shifts toward higher severity with a spike at the maximum, the regression that corresponds to its weaker LIBERO policy.

WFM-Eval: An Evaluation Framework for Video World Models in Robotic Manipulation

World models as a data source

A VLM judge is not enough on its own

1Task completion

2Object hallucination

3Temporal consistency

Phantom Appearance

Phantom Disappearance

Position Drift

Position Jump

Object hallucination is the main differentiator between models

Rankings change across datasets

The diagnostics track downstream policy performance