Bootstrapping World Model

📄 Abstract

To what extent do vision-and-language foundation models possess a realistic world model (observation × action → observation) and a dynamics model (observation × observation → action), when actions are expressed through language? While open-source foundation models struggle with both, we find that fine-tuning them to acquire a dynamics model through supervision is significantly easier than acquiring a world model. In turn, dynamics models can be used to bootstrap world models through two main strategies: 1) weakly supervised learning from synthetic data and 2) inference time verification. Firstly, the dynamics model can annotate actions for unlabelled pairs of video frame observations to expand the training data. We further propose a loss-weighting mechanism for the image tokens weighted by the its importance predicted by a recognition model. Secondly, the dynamics models can assign rewards to multiple samples of the world model to score them, effectively guiding search at inference time. We evaluate the world models resulting from both strategies through the task of action-centric image editing on AURORA-BENCH. Our best model achieves a performance competitive with state-of-the-art image editing models, improving on them by a margin of 15% on real-world subsets according to GPT4o-as-judge, and achieving the best average human evaluation across all subsets of AURORA-BENCH.

🌌 Qualitative Examples in AURORA

We present qualitative examples from AURORA-BENCH. We observe several common failure modes in image editing models. First, they sometimes fail to preserve the scene from the source observation (e.g., PixInstruct on Action-Genome and MagicBrush). Second, some models generate near-identical copies of the source as the target (e.g., GoT on Something- Something). Third, producing realistic outputs remains difficult, as seen in GoT’s result on Kubric. Finally, maintaining object consistency is also a challenge—SmartEdit alters the object in WhatsUp, and CWM does so in Something-Something. Despite the challenges, we also observe several positive editing behaviours from CWM. On Action- Genome, CWM correctly predicts spatial changes, such as opening and closing a drawer, which requires a strong understanding of the spatial concepts. In Something-Something, it is the only model to accurately capture the spatial concept of “falling down.” On Kubric, it demonstrates basic counting ability by correctly adding one keyboard. In WhatsUp, CWM correctly grounds the action to the laptop, while other models mistakenly edit the monitor.

🏞️ Real-World Qualitative Examples

We present a real-world example demonstrating that CWM’s predicted observations can be guided through language expressing actions. CWM is also capable of iteratively generating future observations in multiple steps while maintaining consistency with previous frames.

⚠️ Limitations

While our approach demonstrates effectiveness across AURORA-BENCH, the authors highlight key limitations:

Despite efforts to guide the model via supervised fine-tuning with loss weighting or inference-time verification, we observe that the model may still resort to copying the source observation, especially under low sampling temperatures or ambiguous instructions.
While we show preliminary results of language-steered observation prediction, fine-grained control remains limited, and understanding subtle instructions (e.g., spatial or quantitative edits) remains challenging.
We observe variance across different runs of experiments, likely due to the sensitivity of sampling for generation in multimodal models. To address this, we report results averaged over multiple runs and include performance under the best-of-N sampling distribution during inference for robust comparison.
We primarily experiment with the native VLM Chameleon, as it is currently the only open-source VLM supporting interleaved image-text generation by default. This enables consistent benchmarking, and Chameleon shows competitive performance (e.g., comparable to VILA-U in dynamics prediction). Future work should explore generalization to stronger multimodal foundation models.

Bootstrapping World Models

from Dynamics Models in Multimodal Foundation Models

🎞️ High-Level Overview

📄 Abstract

🌌 Qualitative Examples in AURORA

🏞️ Real-World Qualitative Examples

⚠️ Limitations