ActiveVLA: Injecting Active Perception into Vision-Language-Action Models for Precise 3D Robotic Manipulation

ActiveVLA: Injecting Active Perception into VLA for Precise Manipulation

From Passive to Active Perception: Unlike traditional VLA models that rely on static, end-effector-centric cameras, ActiveVLA treats perception as an active hypothesis-testing process, enabling the robot to "look around" and "look closer".
Active Viewpoint Selection: The model autonomously determines optimal camera perspectives to maximize visibility and task relevance while minimizing occlusions, crucial for cluttered environments.
Active 3D Zoom-in: We introduce a mechanism to identify and selectively enhance high-resolution views of task-critical regions, simulating an optical zoom effect for fine-grained manipulation.
Superior Performance & Generalization: ActiveVLA achieves state-of-the-art results on RLBench (91.8% success rate), COLOSSEUM, and GemBench, and demonstrates strong robustness in real-world tasks involving severe occlusions.

Abstract

Recent advances in robot manipulation have leveraged pre-trained vision-language models (VLMs) and explored integrating 3D spatial signals into these models for effective action prediction, giving rise to the promising vision-language-action (VLA) paradigm. However, most existing approaches overlook the importance of active perception: they typically rely on static, wrist-mounted cameras that provide an end-effector-centric viewpoint. As a result, these models are unable to adaptively select optimal viewpoints or resolutions during task execution, which significantly limits their performance in long-horizon tasks and fine-grained manipulation scenarios.

To address these limitations, we propose ActiveVLA, a novel vision-language-action framework that empowers robots with active perception capabilities for high-precision, fine-grained manipulation. ActiveVLA adopts a coarse-to-fine paradigm, dividing the process into two stages: (1) Critical region localization, where it projects 3D inputs onto multi-view 2D projections to identify critical 3D regions; and (2) Active perception optimization, where it uses an active view selection strategy to choose optimal viewpoints and applies a 3D zoom-in to improve resolution in key areas. Extensive experiments demonstrate that ActiveVLA achieves precise 3D manipulation and outperforms state-of-the-art baselines on three simulation benchmarks (RLBench, COLOSSEUM, GemBench). Moreover, ActiveVLA transfers seamlessly to real-world scenarios, enabling robots to learn high-precision tasks in complex environments.

The Framework of ActiveVLA

ActiveVLA adopts a two-stage, coarse-to-fine strategy. In the coarse stage, three orthographic projections of the 3D scene and a language instruction are processed by the VLM backbone (PaliGemma) to generate 2D heatmaps, which are then back-projected to locate the most relevant 3D region.

In the fine stage, an Active Perception module takes over. It first performs Active Viewpoint Selection to choose new camera views that maximize visibility and diversity based on the identified region. Then, it executes an Active 3D Zoom-in strategy to render high-resolution details of the critical area. The refined visual inputs are fed back into the VLM to predict heatmaps for key end-effector positions, while an action decoder outputs the final 3D action (rotation, gripper state, etc.). This closed-loop design allows the robot to "look closer" before acting.

Results

Fine-grained manipulation in RLBench simulation.

Fine-grained manipulation in COLOSSEUM simulation.

Simulation Benchmarks: ActiveVLA achieves a new state-of-the-art on RLBench with a 91.8% average success rate, surpassing previous methods like BridgeVLA and 3D Diffuser Actor. It also dominates on the COLOSSEUM benchmark (robustness to perturbations) and GemBench (generalization).
Real-World Robustness: As shown in the figures, ActiveVLA effectively handles complex real-world scenarios involving severe occlusions (e.g., retrieving a towel from layered drawers, grasping an occluded banana). By actively selecting viewpoints and zooming in, it can perceive fine-grained geometric details that static cameras miss.
Ablation Studies: Experiments confirm that both Active View Selection (A-VS) and Active 3D Zoom-in (A-3Z) are crucial. A-VS guides where to look, and A-3Z determines how closely to observe, forming a powerful hierarchical perception strategy.

Qualitative Results: Coarse-to-Fine Perception

We visualize the internal reasoning process of ActiveVLA on fine-grained manipulation tasks. This corresponds to Figure 3 in the paper.

Qualitative results of fine-grained manipulation

As shown above, the process is divided into two stages:
1. Coarse Stage (Left of dotted line): The model projects 3D modalities onto orthographic images (a) and predicts heatmaps to roughly mark critical regions (b).
2. Fine Stage (Right of dotted line): Based on these regions, ActiveVLA performs Active View Selection (c) to find the best angle and Active 3D Zoom-in (d) to capture high-resolution details. This enables precise manipulation in complex scenes like "Sweeping dirt to the dustpan" or "Placing jello in the cupboard".

Real-World Experiments: Handling Severe Occlusions

We evaluate ActiveVLA in real-world scenarios characterized by complex spatial structures and severe occlusions. This corresponds to Figure 4 in the paper.

The figure demonstrates ActiveVLA's capability in four challenging tasks:

Pick up the banana (Occluded): The banana is hidden by other fruits. ActiveVLA changes the viewpoint to find it.
Stack blocks (Occluded): The target red cube is hidden behind other blocks. The model finds a view to reveal it.
Cup on holder (Occluded): The cup is obscured by the holder structure.
Towel in drawer (Occluded): The towel is inside a drawer, requiring the robot to look inside.

In all cases, ActiveVLA actively perceives the environment to resolve ambiguities and precisely completes the tasks, demonstrating strong generalization to real-world constraints.