Open-vocabulary 3D visual grounding and reasoning aim to localize objects in a scene based on implicit language descriptions, even when they are occluded. This ability is crucial for tasks such as vision-language navigation and autonomous robotics. However, current methods struggle because they rely heavily on fine-tuning with 3D annotations and mask proposals, which limits their ability to handle diverse semantics and common knowledge required for effective reasoning.
To address this, we propose ReasonGrounder, an LVLM-guided framework that uses hierarchical 3D feature Gaussian fields for adaptive grouping based on physical scale, enabling open-vocabulary 3D grounding and reasoning. ReasonGrounder interprets implicit instructions using large vision-language models (LVLM) and localizes occluded objects through 3D Gaussian splatting. By incorporating 2D segmentation masks from the Segment Anything Model (SAM) and multi-view CLIP embeddings, ReasonGrounder selects Gaussian groups based on object scale, enabling accurate localization through both explicit and implicit language understanding, even in novel, occluded views.
We also contribute ReasoningGD, a new dataset containing over 10K scenes and 2 million annotations for evaluating open-vocabulary 3D grounding and amodal perception under occlusion. Experiments show that ReasonGrounder improves 3D grounding accuracy in real-world scenarios.
ReasonGrounder employs 3D Gaussian Splatting (3DGS), which represents scenes as 3D Gaussian collections with tile-based splatting for efficient, high-resolution rendering. In particular, a standard 3DGS scene is constructed, and 2D segmentation masks from SAM~\cite{kirillov2023segment} are projected into a 3D field. For each mask, a 3D scale is calculated from the depth rendered by the 3DGS. To enhance each Gaussian’s view-independent representation, ReasonGrounder appends a latent feature vector mapped into hierarchical language and instance features via two shallow MLPs: a language mapper and an instance mapper. CLIP embeddings supervise language features across views for multi-view consistency, while instance features refine 2D mask candidates using contrastive loss and 3D scale, supporting feature-based Gaussian grouping. Further, to aid localization, an instruction-conditioned mechanism guided by LVLM selects the reference view most aligned with the instruction. This view and instruction enable comprehension of the target object of intent. Using 3D scale and hierarchical feature Gaussians, ReasonGrounder achieves precise 3D localization and amodal perception in novel views.
This paper introduces the ReasoningGD dataset, which encompasses a diverse range of occlusion scenarios and offers comprehensive, accurate annotations. These annotations include both the visible and occluded parts of target objects from various perspectives, enabling more robust evaluation of reasoning capabilities in 3D visual grounding. The dataset comprises over 10K scenes, each featuring 10 to 15 objects. Each scene includes 100 viewing angles, with annotations provided for both the visible mask of each object at each angle and the full mask, which includes occluded parts. In total, the dataset contains over 2 million detailed annotations.