A Neural Representation Framework with LLM-Driven Spatial Reasoning for Open-Vocabulary 3D Visual Grounding

1Fudan University, 2Shanghai Innovation Institute, 3Tongji University, 4Zhejiang University, 5NeuHelium Co., Ltd

A Neural Representation Framework with LLM-Driven Spatial Reasoning for Open-Vocabulary 3D Visual Grounding

  • Empowering Language Field with Spatial Reasoning for 3D Visual Grounding: We overcome the limitation of language field-based open-vocabulary 3D visual grounding, which struggles to localize instances using spatial relations in language queries, by introducing a visual properties-enhanced hierarchical feature field for robust spatial reasoning and accurate grounding.
  • A Novel SpatialReasoner Framework: The proposed SpatialReasoner leverages an LLM for spatial relation decomposition, alongside a visual properties-enhanced hierarchical feature field for spatial reasoning, to ``think carefully'' and ``look carefully'', enabling accurate step-by-step localization of target instances through explicit spatial reasoning.
  • Outstanding Generality and Performance: Extensive experiments demonstrate that our method can be seamlessly integrated into diverse 3D neural representations, outperforming baseline models in 3D visual grounding and empowering their spatial reasoning capabilities.
Teaser Image

We propose SpatialReasoner for neural representation: prior language field methods localize instances directly from complex user queries but fail to capture spatial relations in both the language query and the environment (left). Our SpatialReasoner instead utilizes a large language model (LLM) and a hierarchical feature field to think and look "step by step" (right). Crucially, reasoning through an LLM—such as spatial relation decomposition—along with hierarchical language and instance fields, allows it to "think carefully" and "look carefully" before localizing the target instance.

Abstract

Open-vocabulary 3D visual grounding aims to localize target objects based on free-form language queries, which is crucial for embodied AI applications such as autonomous navigation, robotics, and augmented reality. Learning 3D language fields through neural representations enables accurate understanding of 3D scenes from limited viewpoints and facilitates the localization of target objects in complex environments. However, existing language field methods struggle to accurately localize instances using spatial relations in language queries, such as ``the book on the chair.'' This limitation mainly arises from inadequate reasoning about spatial relations in both language queries and 3D scenes. In this work, we propose \textbf{SpatialReasoner}, a novel neural representation-based framework with large language model (LLM)-driven spatial reasoning that constructs a visual properties-enhanced hierarchical feature field for open-vocabulary 3D visual grounding. To enable spatial reasoning in language queries, SpatialReasoner fine-tunes an LLM to capture spatial relations and explicitly infer instructions for the target, anchor, and spatial relation. To enable spatial reasoning in 3D scenes, SpatialReasoner incorporates visual properties (opacity and color) to construct a hierarchical feature field. This field represents language and instance features using distilled CLIP features and masks extracted via the Segment Anything Model (SAM). The field is then queried using the inferred instructions in a hierarchical manner to localize the target 3D instance based on the spatial relation in the language query. Notably, SpatialReasoner is not limited to a specific 3D neural representation; it serves as a framework adaptable to various representations, such as Neural Radiance Fields (NeRF) or 3D Gaussian Splatting (3DGS). Extensive experiments show that our framework can be seamlessly integrated into different neural representations, outperforming baseline models in 3D visual grounding while empowering their spatial reasoning capability.

The framework of our SpatialReasoner

Introduction Image

The overall pipeline of SpatialReasoner framework. SpatialReasoner fine-tunes an LLM to decompose language queries into targets, anchors, and spatial relations. It first employs SAM to generate 2D masks for diverse instances in the training dataset. Using a neural representation model (e.g., NeRF or 3DGS) trained on multi-view images, it obtains instance scales via depth deprojection. By integrating visual properties (opacity and color) from scene reconstruction, SpatialReasoner constructs a hierarchical feature field that combines CLIP-extracted language features and mask-extraceted instance features. Reasoned instructions then query these fields to identify target and anchor candidates. By analyzing spatial relations within the query and the 3D scene, SpatialReasoner precisely localizes the referenced 3D instance.

Results

Result Image 1
Result Image 2
  • Qualitative comparisons of spatial reasoning capability. Results demonstrate that our SpatialReasoner achieves spatial reasoning and localizing the target instance based on the spatial relation.
  • Qualitative comparisons of 3D visual grounding capability. Results demonstrate that our SpatialReasoner achieves the superior accuracy in open-vocabulary 3D localization compared to other state-of-the-art methods.