Aww Man!

#1
by mtcl - opened

I thought it is alredy uploaded :)

QuantTrio org

These weight files are extremely large, and uploading them takes some time. Since the free private space is only 100 GB, I first created the repository and then started the upload process.

@JunHowie how big is the total repo once finished uploading? looks like maybe 360GB!? its not Lite version right? hmm.. hoping it will fit in 4x96gb rtx Blackwell

i have 2XRTX 6000 Pro and 512 GB of DDR5 ram. Would I be able to run this quant of the model on my hardware?

QuantTrio org
edited 6 days ago

@JunHowie how big is the total repo once finished uploading? looks like maybe 360GB!? its not Lite version right? hmm.. hoping it will fit in 4x96gb rtx Blackwell

Worth a try. The total file size is 335+- GiB, there is a chance that 4x96 device can successfully handle it.

This repo is the "lite version". Because the model retains a pretty good quality under regular 4 bit quant (surprised), so we leave it as it is.

QuantTrio org

@mtcl @Fernanda24 Completed

@JunHowie downloading now :) Big thx!! Will try to load it on rtx blackwell

QuantTrio org

i have 2XRTX 6000 Pro and 512 GB of DDR5 ram. Would I be able to run this quant of the model on my hardware?

I think that even if vLLM can offload correctly, the inference speed will still be quite slow

@JunHowie I havn't even finished downloading this yet but still curious are you onto Speciale next?😄

QuantTrio org

@JunHowie I havn't even finished downloading this yet but still curious are you onto Speciale next?😄

absolutely

@JunHowie

I ran into some errors:

ValueError: No valid attention backend found for cuda with head_size: 576, dtype: torch.bfloat16, kv_cache_dtype: auto, block_size: 64, use_mla: True, has_sink: False, use_sparse: True. Reasons: {FLASH_ATTN_MLA: [sparse not supported, compute capability not supported, FlashAttention MLA not supported on this device], FLASHMLA: [sparse not supported, compute capability not supported, vllm._flashmla_C is not available, likely was not compiled due to insufficient nvcc version or a supported arch was not in the list of target arches to compile for.], FLASHINFER_MLA: [sparse not supported, compute capability not supported], TRITON_MLA: [sparse not supported], FLASHMLA_SPARSE: [compute capability not supported]}.

QuantTrio org

@JunHowie

I ran into some errors:

ValueError: No valid attention backend found for cuda with head_size: 576, dtype: torch.bfloat16, kv_cache_dtype: auto, block_size: 64, use_mla: True, has_sink: False, use_sparse: True. Reasons: {FLASH_ATTN_MLA: [sparse not supported, compute capability not supported, FlashAttention MLA not supported on this device], FLASHMLA: [sparse not supported, compute capability not supported, vllm._flashmla_C is not available, likely was not compiled due to insufficient nvcc version or a supported arch was not in the list of target arches to compile for.], FLASHINFER_MLA: [sparse not supported, compute capability not supported], TRITON_MLA: [sparse not supported], FLASHMLA_SPARSE: [compute capability not supported]}.

"vllm._flashmla_C is not available" how did you install this vllm env?

QuantTrio org

@Fernanda24 try to launch vLLM with the following four variables, see if it works

export VLLM_USE_DEEP_GEMM=0 
export TORCH_ALLOW_TF32_CUBLAS_OVERRIDE=1
export VLLM_USE_FLASHINFER_MOE_FP16=1
export VLLM_USE_FLASHINFER_SAMPLER=0

@Fernanda24
Same settings might work here: Support running FP4 Deepseek on SM120. https://github.com/sgl-project/sglang/pull/11708

DeepGeem itself doesn't support SM120 as they don't have them RTX PRO 6000
https://github.com/deepseek-ai/DeepGEMM/issues/236
so it needs a different backend

So far https://github.com/createthis/DeepGEMM/pull/1 this is our only hope

@tclf90
I built vLLM from source using the exisiting pytorch install py script. I installed pytorch 2.9.1 before. I have CUDA 13.0 then I installed/checked my Triton and Flashinfer versions too. let me see what I got here

vllm 0.11.2.dev422+g86e178f7c.d20251201.cu130
torch 2.9.1+cu130 (latest stable)
triton 3.5.1
flashinfer-python 0.5.2.dev20251203 (also tried 0.5.3, but they seem have reverted nightly dev to 0.5.2 at least yesterday)

I installed from source because the regular "uv pip install -U vllm" that had worked for weeks broke and no longer works with cuda13 yesterday and pointing install directly to the cu13 release on vllm github also didn't work. in both cases vllm would fail to start because it was looking for a cuda 12 dependency

edit: getting closer with this sglang docker image: lmsysorg/sglang:v0.5.6-cu129-amd64 i can get the model to load but it crashes when i send a prompt...ill do some more testing and see what kind of errors i get

I can now load it in sglang but I only get the first word back

-H "Content-Type: application/json" \
-d '{
  "model": "/mnt/2king/models/QuantTrio/DeepSeek-V3.2-AWQ/",
  "messages": [
    {"role": "user", "content": "Hello, how are you?"}
  ],
  "temperature": 0.1,
  "max_tokens": 50
}' | jq .
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   667  100   486  100   181   1934    720 --:--:-- --:--:-- --:--:--  2657
{
  "id": "3dd9e6af4867419caf53304bc491bc85",
  "object": "chat.completion",
  "created": 1764823417,
  "model": "/mnt/2king/models/QuantTrio/DeepSeek-V3.2-AWQ/",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "I",
        "reasoning_content": null,
        "tool_calls": null
      },
      "logprobs": null,
      "finish_reason": "stop",
      "matched_stop": "NaN happened"
    }
  ],
  "usage": {
    "prompt_tokens": 12,
    "total_tokens": 14,
    "completion_tokens": 2,
    "prompt_tokens_details": null,
    "reasoning_tokens": 0
  },
  "metadata": {
    "weight_version": "default"
  }
}```

edit:

its probably this stuff thats not rtx 6000 sm120 ready 

```[2025-12-04 05:58:35] WARNING server_args.py:951: DP attention is enabled for DeepSeek NSA.
[2025-12-04 05:58:35] WARNING server_args.py:968: NSA with TP mode is active, dp_size=1, tp_size=4, attn_tp_size=4, attention weights will be sharded across 4 ranks.
[2025-12-04 05:58:35] WARNING server_args.py:974: Setting page size to 64 for DeepSeek NSA.
[2025-12-04 05:58:35] WARNING server_args.py:982: Setting KV cache dtype to fp8_e4m3 for DeepSeek NSA.
[2025-12-04 05:58:35] WARNING server_args.py:996: Setting NSA backend to flashmla_auto for prefill and flashmla_kv for decode for FP8 KV Cache.
NSA_DUAL_STREAM=True NSA_FUSE_TOPK=True NSA_FLASHMLA_BACKEND_DECODE_COMPUTE_FP8=True NSA_QUANT_K_CACHE_FAST=True NSA_DEQUANT_K_CACHE_FAST=True
[2025-12-04 05:58:35] INFO awq.py:271: The model is convertible to awq_marlin during runtime. Using awq_marlin kernel.

@Fernanda24
int4 worked on below https://huggingface.co/Intel/DeepSeek-V3.1-Terminus-int4-mixed-AutoRound/discussions/1
[gptq_marlin.py:377] Using MarlinLinearKernel for GPTQMarlinLinearMethod
[cuda.py:411] Using TRITON_MLA attention backend out of potential backends: ['TRITON_MLA']
[layer.py:379] Enabled separate cuda stream for MoE shared_experts

maybe force thisone to VLLM_ATTENTION_BACKEND=TRITON_MLA

@willfalco yes on 0.12.0 I had to use VLLM_ATTENTION_BACKEND=TRITON_MLA 👍

@willfalco @tclf90 @JunHowie

one step closer (Worker_TP3 pid=4471) INFO 12-05 03:09:26 [cuda.py:411] Using FLASHMLA_SPARSE attention backend out of potential backends: ['FLASHMLA_SPARSE'] (Worker_TP3 pid=4471) INFO 12-05 03:09:26 [la

using on this AWQ with "a collection of hacks for flashmla sparse, deepgemm, and vllm to run deepseek v3.2 nvfp4 quant"
docker https://hub.docker.com/r/eous/vllm-sm120/tags
from https://huggingface.co/eousphoros/DeepSeek-V3.2-NVFP4/discussions/1

loaded with https://hub.docker.com/r/eous/vllm-sm120, but when I post some requests it raise "RuntimeError: Assertion error (...smxx_fp8_paged_mqa_logits.hpp:232): smem_size <= SM120ArchSpec::smem_capacity
" and crash

loaded with https://hub.docker.com/r/eous/vllm-sm120, but when I post some requests it raise "RuntimeError: Assertion error (...smxx_fp8_paged_mqa_logits.hpp:232): smem_size <= SM120ArchSpec::smem_capacity
" and crash

yeah hes working on it. ive been looking at it a lot too there seems to be python binding problems in the current iteration. a lot of moving parts

Sign up or log in to comment