TriVLA: A Triple-System-Based Unified Vision-Language-Action Model for General Robot Control

1Fudan University, 2Shanghai Innovation Institute,
Teaser Image

TriVLA is a Vision-Language-Action (VLA) system with a triple-system architecture. The TriVLA converts image observations and language instructions into token sequences, which are processed by the Vision-Language Model (VLM) for reasoning using common knowledge, and the Stable Video Diffusion Model (VDM) as the world model for both current and future predictions. The outputs of the VLM and VDM, along with robot state and action encodings, are fed into the policy learning module to generate motor actions. The TriVLA can be used directly to perform tasks based on prompts or fine-tuned with high-quality data to support complex multi-stage tasks.

Abstract

Recent advancements in vision-language models (VLMs) for common-sense reasoning have led to the development of vision-language-action (VLA) models, enabling robots to perform generalized manipulation. Although existing autoregressive VLA methods design a specific architecture like dual-system to leverage large-scale pretrained knowledge, they tend to capture static information, often neglecting the dynamic aspects vital for embodied tasks. To this end, we propose TriVLA, a unified Vision-Language-Action model with a triple-system architecture for general robot control. The vision-language module (System 2) interprets the environment through vision and language instructions. The dynamics perception module (System 3) inherently produces visual representations that encompass both current static information and predicted future dynamics, thereby providing valuable guidance for policy learning. TriVLA utilizes pre-trained VLM model and fine-tunes pre-trained video foundation model on robot datasets along with internet human manipulation data. The subsequent policy learning module (System 1) generates fluid motor actions in real time. Experimental evaluation demonstrates that TriVLA operates at approximately 36 Hz and surpasses state-of-the-art imitation learning baselines on standard simulation benchmarks as well as challenging real-world manipulation tasks.

Comparison between previous dual-system architectures and our triple-system approach.

Introduction Image

Our TriVLA employs a unified triple-system compositional architecture that integrates world knowledge (System 2) and the world model (System 3), both critical for general policy learning. Prior dual-system methods typically addressed only one component, failing to unify both.

Main Contributions:

  • A Unified Vision-Language-Action Framework: We propose a unified Vision-Language-Action model to integrate the world knowledge and world model for general policy learning across multiple robot embodiments.
  • Triple-System Compositional Architecture: The proposed TriVLA model designs a novel triple-system compositional architecture that possesses both high-level reasoning and dynamic predictive representation, enables a robot to process much more complex prompts and long-horizon manipulation tasks.
  • State-of-the-art performance: TriVLA outperforms other baseline algorithms across simulated and real-world settings, including new combinations of skills seen during training, in the context of scenario. This demonstrates the effectiveness in both alignment with human intent and long-horizon task success.

The pipeline of our TriVLA.

Introduction Image

Our TriVLA adopts a triple-system compositional architecture on the basis of the existing dual-system structure. The System 2 vision-language module employs a pre-trained Eagle-2 Vision-Language Model (VLM) to process the robot’s visual inputs and language instructions, enabling environmental interpretation and task goal understanding. The System 3 dynamics perception module uses a general-purpose video diffusion model to capture entire video sequences and predict future frames based on current observations and task instructions. Subsequently, the System 1 policy learning module—trained using action flow-matching—cross-attends to the output tokens from Systems 2 and 3, and employs embodiment-specific encoders and decoders to handle variable state and action dimensions for generating motor actions.

Results

Result Image 1
Result Image 2
  • Our TriVLA performs exceptionally well in long-horizon missions. Taking the CALVIN simulation task as an example, TriVLA integrates world knowledge for intent understanding and leverages a world model for future prediction when given multiple sequential instructions, enabling effective execution of long-horizon tasks.
  • Visualization of one-step visual representations of dynamics perception module. We can observe that representation can provide valuable information on physical dynamics, although the textures and details are not precise enough.

Video Results