Are the F16 weights upcasted MXFP4? -- Why no `gpt-oss-20b-MXFP4.gguf`?

#34
by rtzurtz - opened

Follow up question to https://huggingface.co/unsloth/gpt-oss-20b-GGUF/discussions/14 and https://huggingface.co/unsloth/gpt-oss-20b-GGUF/discussions/7:

Are the F16 weights maybe just upcasted MXFP4 ones, or why is bartowski recommending to use gpt-oss-20b-MXFP4.gguf (12.1 GB):

Use this one:
gpt-oss-20b-MXFP4.gguf
The reason is, the FFN (feed forward networks) of gpt-oss do not behave nicely when quantized to anything other than MXFP4, so they are kept at that level for everything.

, everyone in https://github.com/ggml-org/llama.cpp/discussions/15396 is also testing only the gpt-oss-20b-MXFP4.gguf and just another example, lmstudio-community also only has the gpt-oss-20b-MXFP4.gguf?

Yes, I think only using MXFP4.gguf is the way to go with gpt oss. Unsloth GGUFs aren't applicable AFAIK in this model.
I think they have anyway made all their GGUFs for a completion's sake. And perhaps the quantization under Q4 also has value for people without enough vram. But if you can use Q4, it only makes sense to use the standard *-MXFP4.ggufs.

Unsloth AI org

The other MXFP4 GGUFs are actually quantized down to 8bit so it's not true 100% full precision. The f16 versions retain the model's full original precision. The difference shouldn't be much but regardless, there is a difference between them.

Our MXFP4 versions (like the others) are actually the Q8 ones. While the true full precision is f16.

@danielhanchen Thanks for the info. Maybe clarify this in the model card as well (not everyone is going to read the comments). (So this *-MXFP4.gguf, which many quant to, is just the latest fake hype, because quanting to MXFP4 is not the same as QATing to MXFP4 (a bit nice wording game by me indeed btw.))

Before creating my question, I did some testing, and maybe the F16 was a bit better, but for the short testing, I decided the result was inconclusive. But now, that it's clarified, I still may revert back to the full F16 weights.

Maybe this is partly the reason for my confusion:

The F32 quant is MXFP4 upcasted to BF16 for every single layer and is unquantized.

PS: Can't find the F32 quant (just a hint, not that I need one).

Sign up or log in to comment