Vocab size 32001 causes problems for quantisation

by TheBloke - opened Jun 29, 2023

Discussion

TheBloke

Jun 29, 2023

•

edited Jun 30, 2023

Hi there

Thanks for the model, looks good.

I'm doing quantisations which will be uploaded at:

I just wanted to let you know that because you've increased the vocab size to 32001, this breaks compatibility with the latest GGML quantisation methods, called k-quant.

This may be resolved some time in the future, but for now it means I can only release the older formats.

My understanding is that that extra 32001th token, the PAD, was added as something of a hack very early in the history of Llama open source models. It was a hack used by one particular model creator I think only because they forgot to set up special_tokens_map.json correctly :) Since then it's stuck around, being copied from model to model, despite not being needed. Unfortunately WizardLM inherited it for example, and a number of models have used their code since.

I'm starting a campaign to try and get it phased out, because it causes tons of problems for developers outside the sphere of Python inference.

Just thought I'd let you know for your next model - and also so I can point people to this post when they inevitably ask me why I've not released the latest quantisation GGML formats for your model :)

Thanks

PS. You can read about the issue with k-quants here: https://github.com/ggerganov/llama.cpp/issues/1919

stingning

OpenBMB org Jun 30, 2023

Hi there

Thanks for the model, looks good.

I'm doing quantisations which will be uploaded at:

https://huggingface.co/openbmb/TheBloke/UltraLM-13B-GGML

https://huggingface.co/openbmb/TheBloke/UltraLM-13B-GPTQ

I just wanted to let you know that because you've increased the vocab size to 32001, this breaks compatibility with the latest GGML quantisation methods, called k-quant.

This may be resolved some time in the future, but for now it means I can only release the older formats.

My understanding is that that extra 32001th token, the PAD, was added as something of a hack very early in the history of Llama open source models. It was a hack used by one particular model creator I think only because they forgot to set up special_tokens_map.json correctly :) Since then it's stuck around, being copied from model to model, despite not being needed. Unfortunately WizardLM inherited it for example, and a number of models have used their code since.

I'm starting a campaign to try and get it phased out, because it causes tons of problems for developers outside the sphere of Python inference.

Just thought I'd let you know for your next model - and also so I can point people to this post when they inevitably ask me why I've not released the latest quantisation GGML formats for your model :)

Thanks

Hi there,

Thanks for the message! We will eliminate the token in the next models. Thanks again!

kirayz

Jul 3, 2023

Dumb question: Is there any possibility that we can manually eliminate the extra token (PAD)? If possible at all, what'd be some pointers that we can chase? Thank you! If not possible, what'd be the reasoning? Curious to know. Thanks for the great model!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment