Code Hallucination in World Models  PDF
Interactive Paper

Hallucination in World Models is
Predictable and Preventable

UC San Diego

Live demo

Live interaction with our 350M-parameter world model trained on 210 tasks. Control it with your keyboard! Our hallucination predictors run at every step; a red border indicates that a hallucination is detected.

Can you make the world model hallucinate?

checking…
Live world-model rollout
WASDact Spacepause Rreset
stable hallucination

Hallucination in world models

Modern generative world models render strikingly realistic, action-controllable futures. But the rollouts they produce frequently hallucinate: they stay visually fluent and superficially plausible while drifting away from the ground-truth dynamics. When used downstream for planning or policy learning, model hallucination leads to incorrect decisions.

In this work, we train a 350M-parameter generative world model on a large dataset spanning 210 tasks and show that, even at this scale, hallucination is both predictable (we can predict when it will happen) and preventable (the underlying issue is, to a great extent, fixable).

Ground truth
World model

An open-loop rollout from our 350M-parameter base model (right) vs. its ground truth (left). The imagined trajectory looks visually plausible but largely ignores the action sequence it was conditioned on. This is exactly the type of hallucination we set out to study.

We argue that hallucination in world models is, first and foremost, a data-coverage problem, making it both predictable and preventable.

A 427-hour testbed for world modeling

Studying coverage needs three things no benchmark offered at once: full control of the training pipeline, behaviorally diverse data across many tasks, and live simulators to probe the gaps online. So we built MMBench2 which includes ground-truth actions, rewards, language instructions, and a live environment for every task.

Naturally, MMBench2 is fully open-source.

0
video
0
tasks
0
domains
0
trajectories
PongMiniArcade
AssemblyMeta-World
Road RunnerAtari
WalkerMuJoCo
Ant RunManiSkill3
Quadruped RunDMControl
AntOGBench
Push GreenRoboDesk
Lunar Lander HoverBox2D
ForagingMiniArcade
Pick PlaceMeta-World
BoxingAtari
Hopper HopDMControl
Bird AttackMiniArcade
Reacher EasyMiniArcade
Point MazeMiniArcade
WhirlpoolMiniArcade
HighwayMiniArcade
Rocket CollectMiniArcade
SpaceshipMiniArcade
Cheetah RunDMControl
Point MazeOGBench
Open SlideRoboDesk
Bipedal Walker HillsBox2D
CoinrunMiniArcade
Window CloseMeta-World
Ms. Pac-ManAtari
AntMuJoCo
Hopper HopManiSkill3
Walker RunDMControl
Point SpiralOGBench
Dungeon Explorer 1MiniArcade
SoccerMeta-World
Reacher HardDMControl
Finger Turn HardDMControl
Cup CatchDMControl
Cartpole SwingupDMControl
LandingMiniArcade
Air HockeyMiniArcade

MMBench2 includes 210 tasks spanning 10 domains. Tasks include locomotion, manipulation, navigation, arcade-style environments, and more. All clips are generated by our 350M-parameter base model trained on MMBench2. If you look closely, you may notice occasional hallucinations. ↔︎ drag to explore

The corpus contains an equal number of trajectories per task but is imbalanced in terms of frames. Episode lengths range from 25 (ManiSkill3) to 1,000 (Atari) steps, so the frame distribution is heavy-tailed. That non-uniformity is exactly the coverage structure we set out to study.

Per-task frame counts across all 210 tasks, sorted high→low and colored by domain (log scale). Hover any bar for the task; the dashed line marks the per-task median of 65,260 frames.

Building a generative world model

On MMBench2 we train a 350M-parameter world model that largely follows the Dreamer 4 recipe. It consists of a video tokenizer, an action-conditioned dynamics model, and a video decoder. Any of its three components can fail independently, resulting in hallucination.

frame Encoder tokenizer z Dynamics block-causal Transformer Decoder renderer frame′
Encoder · 50M params

A video tokenizer encodes each frame into a continuous latent code z, trained jointly with the decoder via masked autoencoding.

Dynamics · 250M params

A block-causal Transformer predicts the next latent from past latents and an action token, trained with shortcut flow-matching. Encoder and decoder are frozen during dynamics training.

Decoder · 50M params

A decoder renders latent codes back to pixels. The decoder is used for supervision during tokenizer training, and human viewing at test-time.

Because the stages compose sequentially, a hallucination introduced early (e.g. a corrupted encoding) is propagated and amplified by everything downstream. Naming which stage produced a failure is therefore the first step to fixing it.

Three modes of hallucination

We identify three types of hallucination, each of which trace to specific components of the world model. In the following, we contrast stable predictions () with each type of hallucination (×).

i
Tokenizer

Perceptual

Surprisingly, the world model can hallucinate before any dynamics prediction at all. When the encoder/decoder is presented with an unseen observation, it may sometimes snap that unfamiliar structure onto the nearest scene it knows; for example, dropping a small object or even reconstructing an unseen maze as a seen one.

Pick a task, watch encode → decode
input input frame input frame input frame input frame
EncodertokenizerzDecoderrenderer
decoded decoded frame decoded frame decoded frame decoded frame
The decoded frame matches the input
×The ball position is not reconstructed
×Unseen maze rebuilt with a different, seen layout
×Unseen visuals are decoded poorly
ii
Dynamics

Action marginalization

For a dynamics model to be useful for decision-making, it needs to respond to actions reliably: a different action should lead to a different outcome. If the training data has limited action diversity, the world model is likely to marginalize over actions, i.e, generating the same trajectory regardless of the action.

Do imagined rollouts reflect the actions taken?
real hold ←
Realthe agent’s actual command
Hold ←command pinned left
none none
Noneno command sent
Noneno command sent
flipped hold →
Flippedaction inverted
Hold →command pinned right
Action-aware: the agent follows the command
Action-aware: the agent follows the command
×Marginalized: steers with ← / → but also moves without a command
×Marginalized: inverting actions has little effect on gait
×Marginalized: gait is identical regardless of action
iii
Dynamics

Scene divergence

A world model can be expected to suffer from compounding error as rollout horizon increases. However, we find that — regardless of rollout horizon — dynamics can also diverge rather abruptly when entering low-coverage regions of the state space. This may result in the agent teleporting, penetrating walls, or objects suddenly disappearing.

Multi-step rollout: imagined future vs. reality
ground truth
world model
×The agent first penetrates a wall and then teleports
ground truth
world model
×The blue velocity field disappears
ground truth
world model
×The agent gets stuck in a small region of the state space

Predicting hallucinations

We find that model hallucinations can be detected at runtime. We derive three label-free predictors that are computable on the fly from quantities the model already produces. Based on these metrics, we can then predict and visualize exactly where hallucinations will happen.

Three hallucination metrics

$u_r$

Tokenizer round-trip residual

$u_r = \lVert \hat z - \mathrm{Enc}(\mathrm{Dec}(\hat z)) \rVert$
How far a predicted latent moves when its decoded frame is re-encoded. On-manifold predictions survive the round trip; hallucinated ones drift.

$u_f$

Flow instability

How much the denoiser's clean-frame prediction moves between Euler substeps. A well-conditioned step settles fast; an under-conditioned one keeps oscillating.

$u_s$

Inter-seed variance

How much the next-latent prediction varies across independent denoising seeds. Concentrated predictions indicate a well-determined transition; dispersed ones indicate where rollouts diverge.

In practice we use dynamism-normalized variants $u^{\text{norm}} = u/m$, dividing out per-step scene motion $m$ so each hallucination predictor tracks uncertainty relative to how much is happening in the scene.

Hallucination vs. state density

If we visualize hallucination as measured by our predictors (in this case tokenizer round-trip residual $u_r$) across the state space, a pattern becomes clear: $u_r$ is high exactly where there is low state density in the training data.

Point Maze · OGBench
trajectory
density
state density
$u_r$
round-trip residual u_r
Rocket Collect · MiniArcade
trajectory
density
state density
$u_r$
round-trip residual u_r
Cup Catch · DMControl
trajectory
density
state density
$u_r$
round-trip residual u_r
Lunar Lander · Box2D
trajectory
density
state density
$u_r$
round-trip residual u_r
state densitylowhigh $u_r$ (hallucination)lowhigh

Do these metrics predict model error?

Yes! Across 9k held-out sequences, each hallucination predictor tracks the realized rollout error at Spearman $\rho \approx 0.8$, without requiring any labels or additional training.

Each point corresponds to a held-out 24-frame trajectory. Hover a point to trace that same sequence across all three panels. The purple curve is the median; the dashed line marks the scene-divergence threshold ($\Delta$PSNR = 0).

Run-time detection of hallucination

Perhaps most interestingly, we can also use our three metrics to detect hallucination at run-time. In the following, we visualize (normalized) values for each of the three predictors as a function of time. Select a rollout below and watch the predictors reach their hallucination threshold as the rollout starts diverging. For comparison, we also include examples of rollouts where no hallucination occurs.

Stable
0.00×
0.0s

Per-frame round-trip residual, flow instability, and inter-seed variance from a single autoregressive model rollout, each shown relative to their hallucination threshold.

Closing the gap at training time

If every failure mode is a coverage gap, one data-centric lever should move all three at once. We resample the existing corpus to be uniform across tasks rather than frames, upweighting under-represented tasks and improving results at no additional cost.

Rollout $\Delta$PSNR (dB)
+0.88
Recon PSNR (dB)
+0.44
Action-shuffle ratio
+0.29
$u_r^{\text{norm}}$ (lower better)
−0.20
$u_s^{\text{norm}}$ (lower better)
−0.14
$u_f^{\text{norm}}$ (lower better)
−0.07

Mean change vs. the base model on held-out trajectories across 200 tasks, applying coverage-aware training to both tokenizer and dynamics. We observe sizable improvements in model quality while all three hallucination predictors are down.

Hallucination-guided data collection

Since hallucination is found to be a data coverage gap, a simple yet effective strategy is to correct model error via targeted data collection. During live environment interaction, we roll out candidate trajectories in the world model, score them by predicted hallucination, and execute the most hallucination-prone one. This allows us to reduce model hallucination with just 50 trajectories per task collected autonomously.

Ground truth
Base model
After targeted collection

Point Maze (OGBench), an unseen layout. The base model quietly drifts to a seen layout; targeted data collection and finetuning recovers the true geometry.

Ground truth
Base model
After targeted collection

Dungeon Explorer (MiniArcade), an unseen transfer task. The base model drifts into a scene from a visually similar Atari game; after finetuning with just 50 trajectories the model faithfully models the dynamics of this new task.

Ground truth
Base model
After targeted collection

Cup Catch (DMControl), an unseen variant. The base model hallucinates visuals seen during training; finetuning restores the correct visuals.

Ground truth
Base model
After targeted collection

Reacher Easy (MiniArcade), an unseen task. The base model dissolves the scene entirely; finetuning restores the arm and its target but the dynamics prediction still diverges eventually.

These clips are qualitative; the finetuned model simply looks right. However, what matters in the end is whether a world model is good enough to act with. To answer this, we evaluate closed-loop planning (MPC) performance with each of 6 finetuned models, varying only the data source (the policy used to collect those 50 trajectories). If curiosity-based data collection using our proposed hallucination predictors is effective, its downstream task performance should approach privileged data collection strategies that rely on humans or expert policies.

Adapting to 10 unseen tasks. Downstream task performance (closed-loop MPC) after finetuning on 50 trajectories per task, varying only the behavior policy used for data collection. Curiosity using $u_r^{\text{norm}}$ reaches ~90% of the expert/human oracles.
Random actions (no model)
0.118
Zero-shot transfer
0.276
After finetuning on 50 trajectories per task, varying only data source
No-op policy
0.163
Random policy
0.228
Curiosity $u_r^{\text{norm}}$ (ours)
0.325
Expert policy
0.362
Human play
0.362
All sources combined
0.390

Why do the data collection strategies rank the way they do? We find that it is, yet again, related to coverage. The figure below shows trajectories collected via each policy on a maze with a central bottleneck. Curiosity tends to target walls, which is exactly where the model has been found to hallucinate by e.g. penetrating.

Data coverage by collection policy. State density over 50 collected trajectories on Point Maze (OGBench), an H-shaped layout joined by a single bottleneck corridor. Darker cells are visited more often.
No-op
State density of no-op data collection: agent barely moves, almost no coverage
Random
State density of random data collection: dense in the two chambers but rarely crosses the bottleneck
Expert $\pi$
State density of expert data collection: clean goal-directed paths through the bottleneck, sparse elsewhere
Curiosity (ours)
State density of curiosity-driven data collection: broad coverage of both chambers and the bottleneck
Human
State density of human play data collection: dense, broad coverage everywhere
less visitedmore visited

In summary, we show that finetuning with just 50 trajectories can greatly improve world modeling in both seen and unseen tasks, and that a majority of these gains can be realized autonomously via curiosity-based exploration targeting hallucination-prone regions.

100% open

To support further research on generative world modeling, we release:

Citation

If you find our work useful, please consider citing the paper:

@misc{Hansen2026Hallucination, title={Hallucination in World Models is Predictable and Preventable}, author={Nicklas Hansen and Xiaolong Wang}, year={2026} }

References

A selection of key works this paper builds on, ordered by year. The complete bibliography is available in our paper.

  1. Pathak et al. (2017). Curiosity-Driven Exploration by Self-Supervised Prediction. ICML.
  2. Lakshminarayanan et al. (2017). Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles. NeurIPS.
  3. Ha & Schmidhuber (2018). World Models. NeurIPS.
  4. Hafner et al. (2020). Dream to Control: Learning Behaviors by Latent Imagination. ICLR.
  5. Sekar et al. (2020). Planning to Explore via Self-Supervised World Models. ICML.
  6. Hansen & Wang (2021). Generalization in Reinforcement Learning by Soft Data Augmentation. ICRA.
  7. He et al. (2022). Masked Autoencoders Are Scalable Vision Learners. CVPR.
  8. Ji et al. (2023). Survey of Hallucination in Natural Language Generation. ACM Computing Surveys.
  9. Hansen, Su & Wang (2024). TD-MPC2: Scalable, Robust World Models for Continuous Control. ICLR.
  10. Huang et al. (2024). VBench: Comprehensive Benchmark Suite for Video Generative Models. CVPR.
  11. Bruce et al. (2024). Genie: Generative Interactive Environments. ICML.
  12. Valevski et al. (2024). Diffusion Models Are Real-Time Game Engines. arXiv:2408.14837.
  13. Hafner, Yan & Lillicrap (2025). Training Agents Inside of Scalable World Models. arXiv:2509.24527.
  14. Hansen, Su & Wang (2026). Learning Massively Multitask World Models for Continuous Control. ICLR.