Instructions to use skyblanket/GLM-5-abliterated with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use skyblanket/GLM-5-abliterated with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="skyblanket/GLM-5-abliterated")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("skyblanket/GLM-5-abliterated")
model = AutoModelForCausalLM.from_pretrained("skyblanket/GLM-5-abliterated")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use skyblanket/GLM-5-abliterated with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "skyblanket/GLM-5-abliterated"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "skyblanket/GLM-5-abliterated",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/skyblanket/GLM-5-abliterated

SGLang

How to use skyblanket/GLM-5-abliterated with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "skyblanket/GLM-5-abliterated" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "skyblanket/GLM-5-abliterated",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "skyblanket/GLM-5-abliterated" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "skyblanket/GLM-5-abliterated",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use skyblanket/GLM-5-abliterated with Docker Model Runner:
```
docker model run hf.co/skyblanket/GLM-5-abliterated
```

Is this abliterated or derestricted?

by kabachuha - opened Feb 20

Discussion

kabachuha

Feb 20

Is this vanilla abliterate or you have also applied norm-preservation and biprojection update?

The latter result in better quality usually

skyblanket

Owner Feb 22

vanilla but it still has issues , are u able to infer ? ortho weight direction done

kabachuha

Feb 22

My plan was to extract a LoRA from the difference of this model and the vanilla through SVD decomposition of the weight differences (example: mergekit LoRA extraction).

This way it is possible to launch it coupled with unsloth dynamic 2bit quants in llama.cpp as LoRAs can be converted in gguf files. The problem is the huge disk space for the difference, and I cannot rent a large disk space server or delete a half of my SSD.

Hmm, technically, if the weights in the shards perfectly correspond to the other shards, this extraction can be done in streaming fashion!

Download shard 1 -> download shard 1* -> substract all shard 1 weights from the shard 1* weights -> extract LoRA for each weight in the difference through SVD -> discard the downloaded shards -> proceed to downloading shards until all are processed -> save the LoRA -> convert the LoRA to .gguf -> launch 2bit unsloth quant with LoRA -> test the model

Yeah, seems like a solid plan. Though may need some debugging and reliable failsafe coding 🥴

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment