Interactive Paper

Hallucination in World Models is
Predictable and Preventable

UC San Diego

Live demo Read the PDF Code Dataset Model

Live demo

Live interaction with our 350M-parameter world model trained on 210 tasks. Control it with your keyboard! Our hallucination predictors run at every step; a red border indicates that a hallucination is detected.

Can you make the world model hallucinate?

checking…

WASDact Spacepause Rreset

stable hallucination

Hallucination in world models

Modern generative world models render strikingly realistic, action-controllable futures. But the rollouts they produce frequently hallucinate: they stay visually fluent and superficially plausible while drifting away from the ground-truth dynamics. When used downstream for planning or policy learning, model hallucination leads to incorrect decisions.

In this work, we train a 350M-parameter generative world model on a large dataset spanning 210 tasks and show that, even at this scale, hallucination is both predictable (we can predict when it will happen) and preventable (the underlying issue is, to a great extent, fixable).

Ground truth

World model

An open-loop rollout from our 350M-parameter base model (right) vs. its ground truth (left). The imagined trajectory looks visually plausible but largely ignores the action sequence it was conditioned on. This is exactly the type of hallucination we set out to study.

We argue that hallucination in world models is, first and foremost, a data-coverage problem, making it both predictable and preventable.

A 427-hour testbed for world modeling

Studying coverage needs three things no benchmark offered at once: full control of the training pipeline, behaviorally diverse data across many tasks, and live simulators to probe the gaps online. So we built MMBench2 which includes ground-truth actions, rewards, language instructions, and a live environment for every task.

Naturally, MMBench2 is fully open-source.

video

tasks

domains

trajectories

PongMiniArcade

AssemblyMeta-World

Road RunnerAtari

WalkerMuJoCo

Ant RunManiSkill3

Quadruped RunDMControl

AntOGBench

Push GreenRoboDesk

Lunar Lander HoverBox2D

ForagingMiniArcade

Pick PlaceMeta-World

BoxingAtari

Hopper HopDMControl

Bird AttackMiniArcade

Reacher EasyMiniArcade

Point MazeMiniArcade

WhirlpoolMiniArcade

HighwayMiniArcade

Rocket CollectMiniArcade

SpaceshipMiniArcade

Cheetah RunDMControl

Point MazeOGBench

Open SlideRoboDesk

Bipedal Walker HillsBox2D

CoinrunMiniArcade

Window CloseMeta-World

Ms. Pac-ManAtari

AntMuJoCo

Hopper HopManiSkill3

Walker RunDMControl

Point SpiralOGBench

Dungeon Explorer 1MiniArcade

SoccerMeta-World

Reacher HardDMControl

Finger Turn HardDMControl

Cup CatchDMControl

Cartpole SwingupDMControl

LandingMiniArcade

Air HockeyMiniArcade

MMBench2 includes 210 tasks spanning 10 domains. Tasks include locomotion, manipulation, navigation, arcade-style environments, and more. All clips are generated by our 350M-parameter base model trained on MMBench2. If you look closely, you may notice occasional hallucinations. ↔︎ drag to explore

The corpus contains an equal number of trajectories per task but is imbalanced in terms of frames. Episode lengths range from 25 (ManiSkill3) to 1,000 (Atari) steps, so the frame distribution is heavy-tailed. That non-uniformity is exactly the coverage structure we set out to study.

Per-task frame counts across all 210 tasks, sorted high→low and colored by domain (log scale). Hover any bar for the task; the dashed line marks the per-task median of 65,260 frames.

Building a generative world model

On MMBench2 we train a 350M-parameter world model that largely follows the Dreamer 4 recipe. It consists of a video tokenizer, an action-conditioned dynamics model, and a video decoder. Any of its three components can fail independently, resulting in hallucination.

Encoder · 50M params

A video tokenizer encodes each frame into a continuous latent code z, trained jointly with the decoder via masked autoencoding.

Dynamics · 250M params

A block-causal Transformer predicts the next latent from past latents and an action token, trained with shortcut flow-matching. Encoder and decoder are frozen during dynamics training.

Decoder · 50M params

A decoder renders latent codes back to pixels. The decoder is used for supervision during tokenizer training, and human viewing at test-time.

Because the stages compose sequentially, a hallucination introduced early (e.g. a corrupted encoding) is propagated and amplified by everything downstream. Naming which stage produced a failure is therefore the first step to fixing it.

Three modes of hallucination

We identify three types of hallucination, each of which trace to specific components of the world model. In the following, we contrast stable predictions (✓) with each type of hallucination (×).

Tokenizer

Perceptual

Surprisingly, the world model can hallucinate before any dynamics prediction at all. When the encoder/decoder is presented with an unseen observation, it may sometimes snap that unfamiliar structure onto the nearest scene it knows; for example, dropping a small object or even reconstructing an unseen maze as a seen one.

Pick a task, watch encode → decode

input

decoded

✓The decoded frame matches the input

×The ball position is not reconstructed

×Unseen maze rebuilt with a different, seen layout

×Unseen visuals are decoded poorly

Dynamics

Action marginalization

For a dynamics model to be useful for decision-making, it needs to respond to actions reliably: a different action should lead to a different outcome. If the training data has limited action diversity, the world model is likely to marginalize over actions, i.e, generating the same trajectory regardless of the action.

Do imagined rollouts reflect the actions taken?

real hold ←

Realthe agent’s actual command

Hold ←command pinned left

none none

Noneno command sent

flipped hold →

Flippedaction inverted

Hold →command pinned right

✓Action-aware: the agent follows the command

×Marginalized: steers with ← / → but also moves without a command

×Marginalized: inverting actions has little effect on gait

×Marginalized: gait is identical regardless of action

iii

Dynamics

Scene divergence

A world model can be expected to suffer from compounding error as rollout horizon increases. However, we find that — regardless of rollout horizon — dynamics can also diverge rather abruptly when entering low-coverage regions of the state space. This may result in the agent teleporting, penetrating walls, or objects suddenly disappearing.

Multi-step rollout: imagined future vs. reality

ground truth

world model

×The agent first penetrates a wall and then teleports

ground truth

world model

×The blue velocity field disappears

ground truth

world model

×The agent gets stuck in a small region of the state space

Predicting hallucinations

We find that model hallucinations can be detected at runtime. We derive three label-free predictors that are computable on the fly from quantities the model already produces. Based on these metrics, we can then predict and visualize exactly where hallucinations will happen.

Three hallucination metrics

$u_r$

Tokenizer round-trip residual

$u_r = \lVert \hat z - \mathrm{Enc}(\mathrm{Dec}(\hat z)) \rVert$
How far a predicted latent moves when its decoded frame is re-encoded. On-manifold predictions survive the round trip; hallucinated ones drift.

$u_f$

Flow instability

How much the denoiser's clean-frame prediction moves between Euler substeps. A well-conditioned step settles fast; an under-conditioned one keeps oscillating.

$u_s$

Inter-seed variance

How much the next-latent prediction varies across independent denoising seeds. Concentrated predictions indicate a well-determined transition; dispersed ones indicate where rollouts diverge.

In practice we use dynamism-normalized variants $u^{\text{norm}} = u/m$, dividing out per-step scene motion $m$ so each hallucination predictor tracks uncertainty relative to how much is happening in the scene.

Hallucination vs. state density

If we visualize hallucination as measured by our predictors (in this case tokenizer round-trip residual $u_r$) across the state space, a pattern becomes clear: $u_r$ is high exactly where there is low state density in the training data.

Point Maze · OGBench

trajectory

Rocket Collect · MiniArcade

trajectory

Cup Catch · DMControl

trajectory

Lunar Lander · Box2D

trajectory

state densitylowhigh $u_r$ (hallucination)lowhigh

Do these metrics predict model error?

Yes! Across 9k held-out sequences, each hallucination predictor tracks the realized rollout error at Spearman $\rho \approx 0.8$, without requiring any labels or additional training.

Each point corresponds to a held-out 24-frame trajectory. Hover a point to trace that same sequence across all three panels. The purple curve is the median; the dashed line marks the scene-divergence threshold ($\Delta$PSNR = 0).

Run-time detection of hallucination

Perhaps most interestingly, we can also use our three metrics to detect hallucination at run-time. In the following, we visualize (normalized) values for each of the three predictors as a function of time. Select a rollout below and watch the predictors reach their hallucination threshold as the rollout starts diverging. For comparison, we also include examples of rollouts where no hallucination occurs.

Stable

0.00×

0.0s

Per-frame round-trip residual, flow instability, and inter-seed variance from a single autoregressive model rollout, each shown relative to their hallucination threshold.

Closing the gap at training time

If every failure mode is a coverage gap, one data-centric lever should move all three at once. We resample the existing corpus to be uniform across tasks rather than frames, upweighting under-represented tasks and improving results at no additional cost.

Rollout $\Delta$PSNR (dB)

+0.88

Recon PSNR (dB)

+0.44

Action-shuffle ratio

+0.29

$u_r^{\text{norm}}$ (lower better)

−0.20

$u_s^{\text{norm}}$ (lower better)

−0.14

$u_f^{\text{norm}}$ (lower better)

−0.07

Mean change vs. the base model on held-out trajectories across 200 tasks, applying coverage-aware training to both tokenizer and dynamics. We observe sizable improvements in model quality while all three hallucination predictors are down.

Hallucination-guided data collection

Since hallucination is found to be a data coverage gap, a simple yet effective strategy is to correct model error via targeted data collection. During live environment interaction, we roll out candidate trajectories in the world model, score them by predicted hallucination, and execute the most hallucination-prone one. This allows us to reduce model hallucination with just 50 trajectories per task collected autonomously.

Ground truth

Base model

After targeted collection

Point Maze (OGBench), an unseen layout. The base model quietly drifts to a seen layout; targeted data collection and finetuning recovers the true geometry.

Ground truth

Base model

After targeted collection

Dungeon Explorer (MiniArcade), an unseen transfer task. The base model drifts into a scene from a visually similar Atari game; after finetuning with just 50 trajectories the model faithfully models the dynamics of this new task.

Ground truth

Base model

After targeted collection

Cup Catch (DMControl), an unseen variant. The base model hallucinates visuals seen during training; finetuning restores the correct visuals.

Ground truth

Base model

After targeted collection

Reacher Easy (MiniArcade), an unseen task. The base model dissolves the scene entirely; finetuning restores the arm and its target but the dynamics prediction still diverges eventually.

These clips are qualitative; the finetuned model simply looks right. However, what matters in the end is whether a world model is good enough to act with. To answer this, we evaluate closed-loop planning (MPC) performance with each of 6 finetuned models, varying only the data source (the policy used to collect those 50 trajectories). If curiosity-based data collection using our proposed hallucination predictors is effective, its downstream task performance should approach privileged data collection strategies that rely on humans or expert policies.

Adapting to 10 unseen tasks. Downstream task performance (closed-loop MPC) after finetuning on 50 trajectories per task, varying only the behavior policy used for data collection. Curiosity using $u_r^{\text{norm}}$ reaches ~90% of the expert/human oracles.

Random actions (no model)

0.118

Zero-shot transfer

0.276

After finetuning on 50 trajectories per task, varying only data source

No-op policy

0.163

Random policy

0.228

Curiosity $u_r^{\text{norm}}$ (ours)

0.325

Expert policy

0.362

Human play

0.362

All sources combined

0.390

Why do the data collection strategies rank the way they do? We find that it is, yet again, related to coverage. The figure below shows trajectories collected via each policy on a maze with a central bottleneck. Curiosity tends to target walls, which is exactly where the model has been found to hallucinate by e.g. penetrating.

Data coverage by collection policy. State density over 50 collected trajectories on Point Maze (OGBench), an H-shaped layout joined by a single bottleneck corridor. Darker cells are visited more often.

No-op

Random

Expert $\pi$

State density of expert data collection: clean goal-directed paths through the bottleneck, sparse elsewhere

Curiosity (ours)

State density of curiosity-driven data collection: broad coverage of both chambers and the bottleneck

Human

less visitedmore visited

In summary, we show that finetuning with just 50 trajectories can greatly improve world modeling in both seen and unseen tasks, and that a majority of these gains can be realized autonomously via curiosity-based exploration targeting hallucination-prone regions.

100% open

To support further research on generative world modeling, we release:

Code

Training and evaluation for our world models on MMBench2.

Dataset

427 hours, 210 tasks, ground-truth actions and rewards, live simulators.

Models

Pretrained and finetuned 350M-parameter models.

Citation

If you find our work useful, please consider citing the paper:

@misc{Hansen2026Hallucination, title={Hallucination in World Models is Predictable and Preventable}, author={Nicklas Hansen and Xiaolong Wang}, year={2026} }

References

A selection of key works this paper builds on, ordered by year. The complete bibliography is available in our paper.

Pathak et al. (2017). Curiosity-Driven Exploration by Self-Supervised Prediction. ICML.
Lakshminarayanan et al. (2017). Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles. NeurIPS.
Ha & Schmidhuber (2018). World Models. NeurIPS.
Hafner et al. (2020). Dream to Control: Learning Behaviors by Latent Imagination. ICLR.
Sekar et al. (2020). Planning to Explore via Self-Supervised World Models. ICML.
Hansen & Wang (2021). Generalization in Reinforcement Learning by Soft Data Augmentation. ICRA.
He et al. (2022). Masked Autoencoders Are Scalable Vision Learners. CVPR.
Ji et al. (2023). Survey of Hallucination in Natural Language Generation. ACM Computing Surveys.
Hansen, Su & Wang (2024). TD-MPC2: Scalable, Robust World Models for Continuous Control. ICLR.
Huang et al. (2024). VBench: Comprehensive Benchmark Suite for Video Generative Models. CVPR.
Bruce et al. (2024). Genie: Generative Interactive Environments. ICML.
Valevski et al. (2024). Diffusion Models Are Real-Time Game Engines. arXiv:2408.14837.
Hafner, Yan & Lillicrap (2025). Training Agents Inside of Scalable World Models. arXiv:2509.24527.
Hansen, Su & Wang (2026). Learning Massively Multitask World Models for Continuous Control. ICLR.