The VLM branch thinks at a higher semantic level.
The VLM serves as a cortex-like reasoner, capturing high-level semantics, general knowledge, and long-horizon intent from uniformly sampled frames with a larger temporal stride.
- Understands what is happening, not only how pixels move.
- Brings in broader world knowledge learned from large-scale multimodal pretraining.
- Guides the predictor toward intent, context, and semantics over longer horizons.