YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

SD-VLM-8B

🌐 Homepage | πŸ€— Dataset | πŸ“– arXiv | GitHub

🎯 The SD-VLM architecture enhances a standard Vision-Language Model (VLM) with 3D spatial awareness through a minimal yet effective modification.

1. Base VLM: Utilizes the LLaVA-1.5-7B framework, consisting of a CLIP-ViT vision encoder, a Vicuna large language model (LLM), and a linear projector connecting them.

2. Depth Encoding Core (DPE): The central innovation is the Depth Positional Encoding (DPE) module. It processes an input depth map (from an external estimator like Depth-Anything-V2) to generate depth-aware embeddings (E_depth). These embeddings are then directly added to the standard image features (E_image) from the vision encoder:

This simple addition injects explicit 3D spatial priors into the model without altering the backbone architecture.

3. Training Approach: The model is efficiently fine-tuned on the MSMU spatial dataset for one epoch using LoRA, keeping the vision encoder frozen. This allows the LLM and projector to learn how to interpret the depth-enhanced visual features for quantitative reasoning.

In essence, SD-VLM's structure is defined by a streamlined integration: it upgrades a standard VLM to understand 3D space by fusing depth information into visual features through a parameter-free additive operation, all trained efficiently on targeted data.

Model Framework

Quick Start!

from llava.model.builder import load_pretrained_model
from llava.mm_utils import get_model_name_from_path
from llava.eval.run_llava import eval_model
from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN
import copy

model_path = "cpystan/SD-VLM-7B"

tokenizer, model, image_processor, context_len = load_pretrained_model(
    model_path=model_path,
    model_base=None,
    model_name=get_model_name_from_path(model_path)
)

input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors='pt').unsqueeze(0).cuda()
image = Image.open(os.path.join(image_folder, image_file)).convert('RGB')
ori_img = copy.deepcopy(image)
image_tensor = process_images([image], image_processor, model.config)[0]

with torch.inference_mode():
    output_ids = model.generate(
        input_ids,
        images=image_tensor.unsqueeze(0).half().to(input_ids.device),
        image_sizes=[image.size],
        do_sample=True if temperature > 0 else False,
        temperature=0.2,
        top_p=None,
        num_beams=1,
        ori_imgs = [ori_img],
        max_new_tokens=1024,
        use_cache=True,)
response= tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0].strip()

πŸ† Mini-Leaderboard

We show a mini-leaderboard here. It shows the results of each sub-category and the overall performance.

Results on MSMU-Bench

Model Existence Object
Counting
Scale
Est.
Grounding Relative
Position
Absolute
Distance
Scale
Comparison
Ref. Object
Est.
Average
Large Language Models (LLMs): Text only
GPT-4-Turbo 12.76 5.21 13.51 12.64 24.84 7.50 36.79 12.04 15.66
Qwen2.5 4.25 0.00 0.78 13.79 0.62 0.00 16.04 1.57 4.63
DeepSeek-V3 0.00 5.24 1.54 6.90 10.56 0.00 25.47 5.24 7.39
Vision-Language Models (VLMs): Image + Text
GPT-4o 44.68 41.67 3.86 27.59 67.08 20.00 54.72 2.09 32.28
Gemini-2 38.30 43.75 23.94 19.54 54.66 12.50 69.81 18.85 35.17
Qwen2.5-VL-72B 59.57 35.42 1.54 13.79 57.76 2.50 66.04 9.95 30.82
Qwen2.5-VL-32B 29.79 41.67 10.81 18.39 60.25 2.50 46.23 10.99 27.59
Qwen2.5-VL-7B 12.76 4.17 0.00 1.15 1.24 0.00 5.66 0.52 3.19
Intern-VL3-78B 47.62 42.71 6.47 26.32 56.94 13.33 64.10 16.46 33.63
Intern-VL3-8B 36.17 41.67 4.63 18.39 60.25 2.50 49.06 8.38 28.54
LLaVA-1.5-7B 1.54 36.46 5.02 20.69 42.86 5.00 38.68 0.52 19.45
Depth-encoded VLMs: Image + Depth + Text
SpatialBot 10.64 46.88 15.83 28.74 66.46 5.00 50.94 8.90 29.17
SpatialRGPT 10.64 36.46 20.08 17.24 60.25 15.00 62.26 9.95 28.98
SD-VLM-8B 87.23 47.92 51.35 42.53 75.16 40.00 55.66 46.07 56.31

Examples

Citation

BibTeX:

@inproceedings{chen2025sdvlm,
      title={SD-VLM: Spatial Measuring and Understanding with Depth-Encoded Vision-Language Models}, 
      author={Pingyi Chen and Yujing Lou and Shen Cao and Jinhui Guo and Lubin Fan and Yue Wu and Lin Yang and Lizhuang Ma and Jieping Ye},
      booktitle={NeurIPS},
      year={2025},
}
Downloads last month
151
Safetensors
Model size
8B params
Tensor type
F16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support