Dream Engines

What is a rollout

A rollout is a video of how a scene evolves under a sequence of robot actions. The model takes one starting frame and a planned action sequence; it returns the predicted future frames.

   start_frame  ──┐
   actions[0..T] ─┴─►  [ Dream Engine ]  ──►  video[0..T]

You're not running a physics simulator and you're not running a neural-network policy. You're running a learned world model — it predicts the dynamics of the visual world directly, conditioned on what the robot does next.

Why it's interesting

  • Fast model-predictive control. Sample K candidate action sequences, score the rollouts, pick the best. predict_batch does this in one fused forward pass.
  • Cheap data augmentation. Take a real teleop start frame, perturb the action sequence, get a synthetic-but-physically-plausible trajectory. Bit-stable on Hopper means the augmented set is reproducible.
  • Sim-to-real evaluation. Score a robot policy against rollouts before deploying to hardware. Three-metric quality gate (PSNR / SSIM / LPIPS) catches silent regressions.

How a rollout is generated

The exact internals depend on the spec — different model families (rectified-flow, block-diffusion, causal-video) use different backbones. As a representative example, the current active spec (DreamDojo · GR-1) works like this:

  1. Encode the start frame with a video tokenizer into a low-resolution latent grid.
  2. Tokenize the conditioning inputs. Each action vector is projected to a token, conditioning the diffusion model alongside the visual latents.
  3. Chunked generation. The model generates frames in chunks (per model.chunk_size). Each chunk runs N diffusion steps; the runner threads the prior chunk's last latent so the visual stream stays continuous.
  4. Decode. The tokenizer's video decoder maps the generated latents back to RGB pixels. The runner emits the result as an mp4 (h.264 baseline).

For DreamDojo · GR-1 specifically: WAN2.1 tokenizer, chunk_size=12, 35 diffusion steps per chunk, ~2.6 s end-to-end on H100 SXM 80GB for a 48-frame rollout. Other specs in the catalog will publish their own numbers; query model.spec.arch for the live values.

Wire shape

The HTTP POST /v1/predict request is multipart:

FieldContentWhat it is
framePNG bytesthe starting frame, 480×640 RGB
actionsnumpy .npy bytes(T, 384) float32, the planned action sequence
seedstringdeterministic seed
num_stepsstring (optional)override diffusion steps
guidancestring (optional)override classifier-free guidance

Response body: raw mp4. Metadata in X-DreamEngine-* response headers.

Limits and gotchas

  • T must be a multiple of model.chunk_size. Query the model for the right number — DreamDojo · GR-1 uses 12, others may differ. The SDK validates this at the boundary before the request hits the network. See Frames, chunks, fps.
  • Resolution is fixed at the spec level. Each spec has a single trained resolution exposed via model.resolution.
  • Determinism. Same seed, same inputs → bit-identical mp4 on the same hardware. Across hardware classes (e.g. L40S, B200) the bytes will differ but the visual content is the same.
  • No streaming yet. v0.1.0 returns the whole mp4 in one HTTP response. SSE-style frame streaming is on the roadmap.