Recent advances in robot manipulation have leveraged pre-trained vision-language models (VLMs) and explored integrating 3D spatial signals into these models for effective action prediction, giving rise to the promising vision-language-action (VLA) paradigm. However, most existing approaches overlook the importance of active perception: they typically rely on static, wrist-mounted cameras that provide an end-effector-centric viewpoint. As a result, these models are unable to adaptively select optimal viewpoints or resolutions during task execution, which significantly limits their performance in long-horizon tasks and fine-grained manipulation scenarios.
To address these limitations, we propose ActiveVLA, a novel vision-language-action framework that empowers robots with active perception capabilities for high-precision, fine-grained manipulation. ActiveVLA adopts a coarse-to-fine paradigm, dividing the process into two stages: (1) Critical region localization, where it projects 3D inputs onto multi-view 2D projections to identify critical 3D regions; and (2) Active perception optimization, where it uses an active view selection strategy to choose optimal viewpoints and applies a 3D zoom-in to improve resolution in key areas. Extensive experiments demonstrate that ActiveVLA achieves precise 3D manipulation and outperforms state-of-the-art baselines on three simulation benchmarks (RLBench, COLOSSEUM, GemBench). Moreover, ActiveVLA transfers seamlessly to real-world scenarios, enabling robots to learn high-precision tasks in complex environments.
ActiveVLA adopts a two-stage, coarse-to-fine strategy. In the coarse stage, three orthographic projections of the 3D scene and a language instruction are processed by the VLM backbone (PaliGemma) to generate 2D heatmaps, which are then back-projected to locate the most relevant 3D region.
In the fine stage, an Active Perception module takes over. It first performs Active Viewpoint Selection to choose new camera views that maximize visibility and diversity based on the identified region. Then, it executes an Active 3D Zoom-in strategy to render high-resolution details of the critical area. The refined visual inputs are fed back into the VLM to predict heatmaps for key end-effector positions, while an action decoder outputs the final 3D action (rotation, gripper state, etc.). This closed-loop design allows the robot to "look closer" before acting.
Fine-grained manipulation in RLBench simulation.
Fine-grained manipulation in COLOSSEUM simulation.
We visualize the internal reasoning process of ActiveVLA on fine-grained manipulation tasks. This corresponds to Figure 3 in the paper.
As shown above, the process is divided into two stages:
1. Coarse Stage (Left of dotted line): The model projects 3D modalities onto orthographic images (a) and predicts heatmaps to roughly mark critical regions (b).
2. Fine Stage (Right of dotted line): Based on these regions, ActiveVLA performs Active View Selection (c) to find the best angle and Active 3D Zoom-in (d) to capture high-resolution details. This enables precise manipulation in complex scenes like "Sweeping dirt to the dustpan" or "Placing jello in the cupboard".
We evaluate ActiveVLA in real-world scenarios characterized by complex spatial structures and severe occlusions. This corresponds to Figure 4 in the paper.
The figure demonstrates ActiveVLA's capability in four challenging tasks:
In all cases, ActiveVLA actively perceives the environment to resolve ambiguities and precisely completes the tasks, demonstrating strong generalization to real-world constraints.