Instructions to use openbmb/UltraLM-13b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use openbmb/UltraLM-13b with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="openbmb/UltraLM-13b")# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("openbmb/UltraLM-13b") model = AutoModelForCausalLM.from_pretrained("openbmb/UltraLM-13b") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use openbmb/UltraLM-13b with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "openbmb/UltraLM-13b" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "openbmb/UltraLM-13b", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/openbmb/UltraLM-13b
- SGLang
How to use openbmb/UltraLM-13b with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "openbmb/UltraLM-13b" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "openbmb/UltraLM-13b", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "openbmb/UltraLM-13b" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "openbmb/UltraLM-13b", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use openbmb/UltraLM-13b with Docker Model Runner:
docker model run hf.co/openbmb/UltraLM-13b
Vocab size 32001 causes problems for quantisation
Hi there
Thanks for the model, looks good.
I'm doing quantisations which will be uploaded at:
- https://huggingface.co/openbmb/TheBloke/UltraLM-13B-GGML
- https://huggingface.co/openbmb/TheBloke/UltraLM-13B-GPTQ
I just wanted to let you know that because you've increased the vocab size to 32001, this breaks compatibility with the latest GGML quantisation methods, called k-quant.
This may be resolved some time in the future, but for now it means I can only release the older formats.
My understanding is that that extra 32001th token, the PAD, was added as something of a hack very early in the history of Llama open source models. It was a hack used by one particular model creator I think only because they forgot to set up special_tokens_map.json correctly :) Since then it's stuck around, being copied from model to model, despite not being needed. Unfortunately WizardLM inherited it for example, and a number of models have used their code since.
I'm starting a campaign to try and get it phased out, because it causes tons of problems for developers outside the sphere of Python inference.
Just thought I'd let you know for your next model - and also so I can point people to this post when they inevitably ask me why I've not released the latest quantisation GGML formats for your model :)
Thanks
PS. You can read about the issue with k-quants here: https://github.com/ggerganov/llama.cpp/issues/1919
Hi there
Thanks for the model, looks good.
I'm doing quantisations which will be uploaded at:
- https://huggingface.co/openbmb/TheBloke/UltraLM-13B-GGML
- https://huggingface.co/openbmb/TheBloke/UltraLM-13B-GPTQ
I just wanted to let you know that because you've increased the vocab size to 32001, this breaks compatibility with the latest GGML quantisation methods, called k-quant.
This may be resolved some time in the future, but for now it means I can only release the older formats.
My understanding is that that extra 32001th token, the PAD, was added as something of a hack very early in the history of Llama open source models. It was a hack used by one particular model creator I think only because they forgot to set up special_tokens_map.json correctly :) Since then it's stuck around, being copied from model to model, despite not being needed. Unfortunately WizardLM inherited it for example, and a number of models have used their code since.
I'm starting a campaign to try and get it phased out, because it causes tons of problems for developers outside the sphere of Python inference.
Just thought I'd let you know for your next model - and also so I can point people to this post when they inevitably ask me why I've not released the latest quantisation GGML formats for your model :)
Thanks
Hi there,
Thanks for the message! We will eliminate the token in the next models. Thanks again!
Dumb question: Is there any possibility that we can manually eliminate the extra token (PAD)? If possible at all, what'd be some pointers that we can chase? Thank you! If not possible, what'd be the reasoning? Curious to know. Thanks for the great model!