YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

SD-VLM-8B

🌐 Homepage | 🤗 Dataset | 📖 arXiv | GitHub

🎯 The SD-VLM architecture enhances a standard Vision-Language Model (VLM) with 3D spatial awareness through a minimal yet effective modification.

1. Base VLM: Utilizes the LLaVA-1.5-7B framework, consisting of a CLIP-ViT vision encoder, a Vicuna large language model (LLM), and a linear projector connecting them.

2. Depth Encoding Core (DPE): The central innovation is the Depth Positional Encoding (DPE) module. It processes an input depth map (from an external estimator like Depth-Anything-V2) to generate depth-aware embeddings (E_depth). These embeddings are then directly added to the standard image features (E_image) from the vision encoder:

This simple addition injects explicit 3D spatial priors into the model without altering the backbone architecture.

3. Training Approach: The model is efficiently fine-tuned on the MSMU spatial dataset for one epoch using LoRA, keeping the vision encoder frozen. This allows the LLM and projector to learn how to interpret the depth-enhanced visual features for quantitative reasoning.

In essence, SD-VLM's structure is defined by a streamlined integration: it upgrades a standard VLM to understand 3D space by fusing depth information into visual features through a parameter-free additive operation, all trained efficiently on targeted data.

Model Framework

Quick Start!

from llava.model.builder import load_pretrained_model
from llava.mm_utils import get_model_name_from_path
from llava.eval.run_llava import eval_model
from llava.constants import IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN
import copy

model_path = "cpystan/SD-VLM-7B"

tokenizer, model, image_processor, context_len = load_pretrained_model(
    model_path=model_path,
    model_base=None,
    model_name=get_model_name_from_path(model_path)
)

input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors='pt').unsqueeze(0).cuda()
image = Image.open(os.path.join(image_folder, image_file)).convert('RGB')
ori_img = copy.deepcopy(image)
image_tensor = process_images([image], image_processor, model.config)[0]

with torch.inference_mode():
    output_ids = model.generate(
        input_ids,
        images=image_tensor.unsqueeze(0).half().to(input_ids.device),
        image_sizes=[image.size],
        do_sample=True if temperature > 0 else False,
        temperature=0.2,
        top_p=None,
        num_beams=1,
        ori_imgs = [ori_img],
        max_new_tokens=1024,
        use_cache=True,)
response= tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0].strip()

🏆 Mini-Leaderboard

We show a mini-leaderboard here. It shows the results of each sub-category and the overall performance.

Results on MSMU-Bench

Model	Existence	Object Counting	Scale Est.	Grounding	Relative Position	Absolute Distance	Scale Comparison	Ref. Object Est.	Average
Large Language Models (LLMs): Text only
GPT-4-Turbo	12.76	5.21	13.51	12.64	24.84	7.50	36.79	12.04	15.66
Qwen2.5	4.25	0.00	0.78	13.79	0.62	0.00	16.04	1.57	4.63
DeepSeek-V3	0.00	5.24	1.54	6.90	10.56	0.00	25.47	5.24	7.39
Vision-Language Models (VLMs): Image + Text
GPT-4o	44.68	41.67	3.86	27.59	67.08	20.00	54.72	2.09	32.28
Gemini-2	38.30	43.75	23.94	19.54	54.66	12.50	69.81	18.85	35.17
Qwen2.5-VL-72B	59.57	35.42	1.54	13.79	57.76	2.50	66.04	9.95	30.82
Qwen2.5-VL-32B	29.79	41.67	10.81	18.39	60.25	2.50	46.23	10.99	27.59
Qwen2.5-VL-7B	12.76	4.17	0.00	1.15	1.24	0.00	5.66	0.52	3.19
Intern-VL3-78B	47.62	42.71	6.47	26.32	56.94	13.33	64.10	16.46	33.63
Intern-VL3-8B	36.17	41.67	4.63	18.39	60.25	2.50	49.06	8.38	28.54
LLaVA-1.5-7B	1.54	36.46	5.02	20.69	42.86	5.00	38.68	0.52	19.45
Depth-encoded VLMs: Image + Depth + Text
SpatialBot	10.64	46.88	15.83	28.74	66.46	5.00	50.94	8.90	29.17
SpatialRGPT	10.64	36.46	20.08	17.24	60.25	15.00	62.26	9.95	28.98
SD-VLM-8B	87.23	47.92	51.35	42.53	75.16	40.00	55.66	46.07	56.31

Examples

Citation

BibTeX:

@inproceedings{chen2025sdvlm,
      title={SD-VLM: Spatial Measuring and Understanding with Depth-Encoded Vision-Language Models}, 
      author={Pingyi Chen and Yujing Lou and Shen Cao and Jinhui Guo and Lubin Fan and Yue Wu and Lin Yang and Lizhuang Ma and Jieping Ye},
      booktitle={NeurIPS},
      year={2025},
}

Downloads last month: 151

Safetensors

Model size

8B params

Tensor type

F16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support