TriVLA: A Triple-System-Based Unified Vision-Language-Action Model with Episodic World Modeling for General Robot Control

1Fudan University, 2Shanghai Innovation Institute,

TriVLA is the first framework to formalize an episodic world model within a unified triple-system architecture, drawing inspiration from cognitive neuroscience theories of episodic memory. By integrating multimodal grounding and rich temporal dynamics, TriVLA provides high-level reasoning and dynamic prediction, enabling robots to accumulate, recall, and predict sequential experiences. Experiments show that TriVLA operates efficiently and consistently outperforms state-of-the-art policy baselines. TriVLA significantly improves long-horizon reasoning, sample efficiency, and open-ended goal achievement. These results highlight the potential of episodic world model reasoning as a solid foundation for robust and generalizable robot control systems.

Teaser Image

TriVLA is a unified Vision-Language-Action framework that adopts a triple-system architecture inspired by the episodic world model. Image and language inputs are processed by a Vision-Language Model for multimodal perception. A Video Diffusion Model provides dynamic world modeling and future prediction. The policy module integrates sequential outputs, robot state, and action history and generates real-time actions for complex manipulation tasks.

Abstract

Recent advances in vision–language models (VLMs) have enabled robots to follow open-ended instructions and demonstrate impressive commonsense reasoning. However, current vision–language–action (VLA) frameworks primarily rely on static representations and limited temporal context, restricting agents to short-horizon, reactive behaviors and hindering robust generalization in dynamic embodied environments. Inspired by cognitive neuroscience theories of episodic memory, we are, to our knowledge, among the first to introduce a formalized episodic world model in VLA, enabling embodied robots to accumulate, recall, and predict sequential experiences. As an instantiation of this concept, our unified TriVLA realizes the episodic world model through a triple-system architecture: integrating multimodal grounding from a pretrained VLM (System 2) and temporally rich dynamics perception from a video diffusion model (System 3). This enables the agent to accumulate and recall sequential experiences, interpret current contexts, and predict future environmental evolution. Guided by episodic representations that span both the past and anticipated future, the downstream policy (System 1) generates coherent, context-aware action sequences through flow-matching and cross-modal attention mechanisms. Experimental results show that TriVLA operates efficiently at ~36 Hz and consistently outperforms baseline models on standard benchmarks and challenging real-world manipulation tasks. It demonstrates strong long-horizon planning and open-ended intent understanding, showcasing the advantages of episodic world model-inspired reasoning for robust, generalizable robot intelligence.

Comparison between previous dual-system architectures and our triple-system approach.

Introduction Image

Comparison between dual-system architectures and our episodic world model-guided TriVLA. TriVLA implements the episodic world model using a triple-system architecture. In contrast, previous dual-system methods relied on static representations and limited temporal context, which restricted agents to short-horizon, reactive behaviors in dynamic environments.

Main Contributions:

  • A Unified Vision-Language-Action Framework: We propose a unified Vision-Language-Action model to integrate the world knowledge and world model for general policy learning across multiple robot embodiments.
  • Triple-System Compositional Architecture: The proposed TriVLA model designs a novel triple-system compositional architecture that possesses both high-level reasoning and dynamic predictive representation, enables a robot to process much more complex prompts and long-horizon manipulation tasks.
  • State-of-the-art performance: TriVLA outperforms other baseline algorithms across simulated and real-world settings, including new combinations of skills seen during training, in the context of scenario. This demonstrates the effectiveness in both alignment with human intent and long-horizon task success.

The pipeline of our TriVLA.

Introduction Image

The pipeline of TriVLA. TriVLA is a unified Vision-Language-Action framework built on a triple-system paradigm. System 2 employs a pre-trained Eagle-2 VLM for episodic multimodal perception, while System 3 utilizes a general-purpose VDM to model episodic dynamics and sequential changes. Together, these modules form a joint episodic world model with rich, temporally extended representations. System 1 serves as the policy module, applying action flow-matching to integrate all outputs along with robot state and action history.

Results

Result Image 1
Result Image 2
  • Our TriVLA performs exceptionally well in long-horizon missions. Taking the CALVIN simulation task as an example, TriVLA integrates world knowledge for intent understanding and leverages a world model for future prediction when given multiple sequential instructions, enabling effective execution of long-horizon tasks.
  • Visualization of one-step visual representations of dynamics perception module. We can observe that representation can provide valuable information on physical dynamics, although the textures and details are not precise enough.
--> -->