What is a rollout
A rollout is a video of how a scene evolves under a sequence of robot actions. The model takes one starting frame and a planned action sequence; it returns the predicted future frames.
start_frame ──┐
actions[0..T] ─┴─► [ Dream Engine ] ──► video[0..T]You're not running a physics simulator and you're not running a neural-network policy. You're running a learned world model — it predicts the dynamics of the visual world directly, conditioned on what the robot does next.
Why it's interesting
- Fast model-predictive control. Sample K candidate action
sequences, score the rollouts, pick the best.
predict_batchdoes this in one fused forward pass. - Cheap data augmentation. Take a real teleop start frame, perturb the action sequence, get a synthetic-but-physically-plausible trajectory. Bit-stable on Hopper means the augmented set is reproducible.
- Sim-to-real evaluation. Score a robot policy against rollouts before deploying to hardware. Three-metric quality gate (PSNR / SSIM / LPIPS) catches silent regressions.
How a rollout is generated
The exact internals depend on the spec — different model families (rectified-flow, block-diffusion, causal-video) use different backbones. As a representative example, the current active spec (DreamDojo · GR-1) works like this:
- Encode the start frame with a video tokenizer into a low-resolution latent grid.
- Tokenize the conditioning inputs. Each action vector is projected to a token, conditioning the diffusion model alongside the visual latents.
- Chunked generation. The model generates frames in chunks (per
model.chunk_size). Each chunk runs N diffusion steps; the runner threads the prior chunk's last latent so the visual stream stays continuous. - Decode. The tokenizer's video decoder maps the generated latents back to RGB pixels. The runner emits the result as an mp4 (h.264 baseline).
For DreamDojo · GR-1 specifically: WAN2.1 tokenizer, chunk_size=12,
35 diffusion steps per chunk, ~2.6 s end-to-end on H100 SXM 80GB for a
48-frame rollout. Other specs in the catalog will publish their own
numbers; query model.spec.arch for the live values.
Wire shape
The HTTP POST /v1/predict request is multipart:
| Field | Content | What it is |
|---|---|---|
frame | PNG bytes | the starting frame, 480×640 RGB |
actions | numpy .npy bytes | (T, 384) float32, the planned action sequence |
seed | string | deterministic seed |
num_steps | string (optional) | override diffusion steps |
guidance | string (optional) | override classifier-free guidance |
Response body: raw mp4. Metadata in X-DreamEngine-* response headers.
Limits and gotchas
Tmust be a multiple ofmodel.chunk_size. Query the model for the right number — DreamDojo · GR-1 uses 12, others may differ. The SDK validates this at the boundary before the request hits the network. See Frames, chunks, fps.- Resolution is fixed at the spec level. Each spec has a single
trained resolution exposed via
model.resolution. - Determinism. Same
seed, same inputs → bit-identical mp4 on the same hardware. Across hardware classes (e.g. L40S, B200) the bytes will differ but the visual content is the same. - No streaming yet. v0.1.0 returns the whole mp4 in one HTTP response. SSE-style frame streaming is on the roadmap.