ReasonGrounder: LVLM-Guided Hierarchical Feature Splatting for Open-Vocabulary 3D Visual Grounding and Reasoning

1Fudan University, 2Nanyang Technological University, 3Shanghai Innovation Institute, 4NeuHelium Co., Ltd

CVPR 2025

Teaser Image

ReasonGrounder achieves the task of open-vocabulary 3D visual grounding and reasoning. In a given scene, the user observes from a perspective with occlusions and asks questions such as: ``Can you localize the red, round, sweet fruit on the table that is partially occluded by the toy sheep?" Open-vocabulary 3D visual grounding and reasoning seeks to interpret complex implicit queries, deduce answers, and accurately localize the target object, even when it is partially or fully occluded from the current viewpoint.

Abstract

Open-vocabulary 3D visual grounding and reasoning aim to localize objects in a scene based on implicit language descriptions, even when they are occluded. This ability is crucial for tasks such as vision-language navigation and autonomous robotics. However, current methods struggle because they rely heavily on fine-tuning with 3D annotations and mask proposals, which limits their ability to handle diverse semantics and common knowledge required for effective reasoning.

To address this, we propose ReasonGrounder, an LVLM-guided framework that uses hierarchical 3D feature Gaussian fields for adaptive grouping based on physical scale, enabling open-vocabulary 3D grounding and reasoning. ReasonGrounder interprets implicit instructions using large vision-language models (LVLM) and localizes occluded objects through 3D Gaussian splatting. By incorporating 2D segmentation masks from the Segment Anything Model (SAM) and multi-view CLIP embeddings, ReasonGrounder selects Gaussian groups based on object scale, enabling accurate localization through both explicit and implicit language understanding, even in novel, occluded views.

We also contribute ReasoningGD, a new dataset containing over 10K scenes and 2 million annotations for evaluating open-vocabulary 3D grounding and amodal perception under occlusion. Experiments show that ReasonGrounder improves 3D grounding accuracy in real-world scenarios.

Introduction

Introduction Image

ReasonGrounder employs 3D Gaussian Splatting (3DGS), which represents scenes as 3D Gaussian collections with tile-based splatting for efficient, high-resolution rendering. In particular, a standard 3DGS scene is constructed, and 2D segmentation masks from SAM~\cite{kirillov2023segment} are projected into a 3D field. For each mask, a 3D scale is calculated from the depth rendered by the 3DGS. To enhance each Gaussian’s view-independent representation, ReasonGrounder appends a latent feature vector mapped into hierarchical language and instance features via two shallow MLPs: a language mapper and an instance mapper. CLIP embeddings supervise language features across views for multi-view consistency, while instance features refine 2D mask candidates using contrastive loss and 3D scale, supporting feature-based Gaussian grouping. Further, to aid localization, an instruction-conditioned mechanism guided by LVLM selects the reference view most aligned with the instruction. This view and instruction enable comprehension of the target object of intent. Using 3D scale and hierarchical feature Gaussians, ReasonGrounder achieves precise 3D localization and amodal perception in novel views.

Main Contributions:

  • Enhanced 3D Visual Grounding: We address limitations in open-vocabulary 3D visual grounding, such as reliance on 3D annotations and limited semantic understanding, by using hierarchical 3D Gaussian fields with LVLMs for robust grounding and reasoning.
  • A Novel ReasonGrounder Framework: The proposed ReasonGrounder leverages hierarchical 3D feature Gaussian fields for adaptive Gaussian grouping with 3D scale, enabling effective open-vocabulary 3D visual grounding and reasoning. ReasonGrounder interprets implicit instructions using large vision-language models (LVLM) and accurately localizes occluded objects with hierarchical 3D feature Gaussian splatting.
  • Hierarchical Feature Splatting and Amodal Perception: ReasonGrounder empowers the hierarchical features of 3D Gaussians and selects Gaussian groups based on the target object's scale. LVLM aids in interpreting complex instructions and locating objects even when partially or fully occluded.
  • Dataset Contributions: A new ReasoningGD dataset offers over 10K complex scenes with 2 million annotations, including point clouds, RGB-D images with detailed labels, camera poses, and 2D modal/amodal masks of views, enabling rigorous evaluation of 3D visual grounding with implicit instruction handling and occlusion robustness.

Results

Introduction Image
Introduction Image
  • ReasonGrounder uses LVLM to process implicit queries requiring complex reasoning and select the Gaussian group for 3D localization. The results confirm that ReasonGrounder excels in localizing target objects, benefiting from the 3D consistency. Additionally, ReasonGrounder provides explanatory answers, showcasing its strength in implicit instruction reasoning, 3D understanding, and conversation. To test robustness, we selected five challenging scenes with small proportions, including multi-hierarchical structures and similar objects, along with ten text queries per scene from the LERF and ReasoningGD datasets.
  • Existing open-vocabulary 3D visual grounding methods struggle with localizing complete objects in novel views with occlusion, limiting their real-world applicability. In contrast, ReasonGrounder uses hierarchical Gaussian grouping to effectively tackle this issue. The ReasoningGD dataset introduced here includes scenes with amodal binary mask annotations, accurately representing the full shape of occluded objects from different views. These results demonstrate that ReasonGrounder successfully achieves amodal perception, accurately localizing complete objects regardless of the occlusion level.

ReasoningGD dataset

Introduction Image

This paper introduces the ReasoningGD dataset, which encompasses a diverse range of occlusion scenarios and offers comprehensive, accurate annotations. These annotations include both the visible and occluded parts of target objects from various perspectives, enabling more robust evaluation of reasoning capabilities in 3D visual grounding. The dataset comprises over 10K scenes, each featuring 10 to 15 objects. Each scene includes 100 viewing angles, with annotations provided for both the visible mask of each object at each angle and the full mask, which includes occluded parts. In total, the dataset contains over 2 million detailed annotations.