Latent world models + vision-language reasoning

Two Brains, One World Model.

ThinkJEPA pairs a cortex-like VLM reasoner for high-level semantics and long-horizon intent with a cerebellum-like JEPA controller for low-level dynamics, physical consistency, and fast local correction. Together they form a semantically grounded latent world model for downstream embodied prediction and control.

arXiv GitHub Hugging Face Cache

Dual-Temporal Dense JEPA observation plus uniformly sampled VLM reasoning.

Pyramid Guidance Multi-layer VLM signals guide prediction instead of only the final layer.

Early May 2026 Current target window for broader public code release.

Inside the ThinkJEPA. The VLM serves as a cortex-like reasoner, capturing high-level semantics and long-horizon intent. The JEPA branch behaves more like a cerebellum-like controller, modeling low-level dynamics, physical consistency, and rapid local corrections.

Abstract ThinkJEPA addresses a central limitation of dense latent world models: they forecast well at the local physical level, but often miss long-horizon semantics and intent. The paper combines a dense JEPA branch with a larger-stride VLM thinker branch and injects multi-layer VLM reasoning into the predictor through hierarchical pyramid representation extraction, improving downstream trajectory prediction and long-horizon rollout behavior.

Story

A Cortex and a Cerebellum

The project story is simple and memorable: ThinkJEPA has two complementary brains. One brain thinks. The other brain controls. The paper's dual-temporal architecture maps naturally onto that idea without reducing the method to a metaphor.

Cortex-like reasoner

The VLM branch thinks at a higher semantic level.

The VLM serves as a cortex-like reasoner, capturing high-level semantics, general knowledge, and long-horizon intent from uniformly sampled frames with a larger temporal stride.

Understands what is happening, not only how pixels move.
Brings in broader world knowledge learned from large-scale multimodal pretraining.
Guides the predictor toward intent, context, and semantics over longer horizons.

Cerebellum-like controller

The JEPA branch keeps the model physically grounded.

The JEPA branch behaves more like a cerebellum-like controller, modeling low-level dynamics, physical consistency, contact-sensitive motion, and rapid local corrections.

Tracks dense local motion that sparse VLM processing can miss.
Provides the fine-grained control signal needed for robotic hands and other effectors.
Supports downstream embodied tasks, with the paper's current experiments centered on hand-manipulation trajectory prediction.

Method

Dual-Temporal Pathway with Pyramid Guidance

Reading the paper closely, the method hinges on three ideas: keep a dense JEPA pathway for physical forecasting, add a large-stride VLM thinker pathway for semantics, and use hierarchical pyramid extraction so the predictor can benefit from the VLM's progressive reasoning process instead of only a terminal representation.

Architecture Outline

1. Dense JEPA branch Consumes densely sampled observations to preserve local dynamics and interaction cues.

2. VLM thinker branch Processes uniformly sampled frames at a larger temporal stride to capture broader intent and semantic context.

3. Pyramid representation extraction Aggregates intermediate and deep VLM representations into guidance signals compatible with latent prediction.

4. Predictor and task head Injects VLM guidance into JEPA prediction and outputs downstream trajectory forecasts for embodied tasks.

What matters technically

The VLM is used as guidance, not as a standalone dense predictor.
The larger temporal stride gives the reasoning branch a broader perception field.
Multi-layer VLM features are more useful than relying only on a final language-oriented layer.
The latent forecasting interface is preserved, which keeps the method useful for downstream control-oriented tasks.

Evidence

Why the Two-Brain Story is Useful

The paper's core empirical message is not merely that "more features help." It is that semantic reasoning helps most when it guides, rather than replaces, a physically grounded latent predictor.

Model	ADE	FDE	Accuracy
ThinkJEPA	0.0614	0.0556	0.5960
VJEPA_only	0.0706	0.0656	0.4706
VLM_only	0.1420	0.1441	0.0842

Interpretation

The VLM-only line underperforms badly on fine-grained prediction, which matches the paper's argument that language-aligned reasoning alone is not an adequate substitute for dense physical modeling.

The JEPA-only line remains strong on local forecasting, but ThinkJEPA improves it by adding broader intent and semantic grounding. That is exactly the motivation behind the cortex / cerebellum story.

Release

Current Branch and Next Release Window

This branch is a project-page snapshot rather than the full research workspace. It keeps the webpage, logo assets, and a small curated code surface while a broader code release is prepared.

Code Release Window

Early May 2026

As of April 1, 2026, the broader public code release is planned for early May 2026. Until then, this page presents the paper, the project story, and a focused subset of implementation files.

Current Status

This project page presents the paper's central idea, method framing, and release window while the broader public code release is being prepared.

Citation

@article{zhang2026thinkjepa,
  title={ThinkJEPA: Empowering Latent World Models with Large Vision-Language Reasoning Model},
  author={Zhang, Haichao and Li, Yijiang and He, Shwai and Nagarajan, Tushar and Chen, Mingfei and Lu, Jianglin and Li, Ang and Fu, Yun},
  journal={arXiv preprint arXiv:2603.22281},
  year={2026}
}