Instructions to use n0ctyx/Qwen3-4B-Instruct-Uncensored with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use n0ctyx/Qwen3-4B-Instruct-Uncensored with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="n0ctyx/Qwen3-4B-Instruct-Uncensored") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("n0ctyx/Qwen3-4B-Instruct-Uncensored") model = AutoModelForCausalLM.from_pretrained("n0ctyx/Qwen3-4B-Instruct-Uncensored") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Inference
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use n0ctyx/Qwen3-4B-Instruct-Uncensored with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "n0ctyx/Qwen3-4B-Instruct-Uncensored" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "n0ctyx/Qwen3-4B-Instruct-Uncensored", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/n0ctyx/Qwen3-4B-Instruct-Uncensored
- SGLang
How to use n0ctyx/Qwen3-4B-Instruct-Uncensored with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "n0ctyx/Qwen3-4B-Instruct-Uncensored" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "n0ctyx/Qwen3-4B-Instruct-Uncensored", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "n0ctyx/Qwen3-4B-Instruct-Uncensored" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "n0ctyx/Qwen3-4B-Instruct-Uncensored", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use n0ctyx/Qwen3-4B-Instruct-Uncensored with Docker Model Runner:
docker model run hf.co/n0ctyx/Qwen3-4B-Instruct-Uncensored
Qwen3-4B-Instruct Uncensored
An uncensored version of Qwen3-4B-Instruct-2507 with safety refusals removed via directional abliteration, while preserving the original model's intelligence and capabilities.
What is Abliteration?
Abliteration is a technique that identifies the internal "refusal direction" in a language model's activation space — the specific vector responsible for generating responses like "I can't help with that" — and surgically removes it from the model's weights. Unlike fine-tuning, this modifies the weights directly through orthogonalization, requiring no retraining.
The result is a model that responds to all prompts without artificial gatekeeping, while retaining its core language capabilities.
Abliteration Parameters
| Parameter | Value |
|---|---|
| direction_index | 18.83 |
| attn.o_proj.max_weight | 1.42 |
| attn.o_proj.max_weight_position | 23.83 |
| attn.o_proj.min_weight | 1.38 |
| attn.o_proj.min_weight_distance | 17.62 |
| mlp.down_proj.max_weight | 1.18 |
| mlp.down_proj.max_weight_position | 27.92 |
| mlp.down_proj.min_weight | 0.58 |
| mlp.down_proj.min_weight_distance | 17.38 |
Performance
| Metric | This Model | Original Model |
|---|---|---|
| KL Divergence | 0.0785 | 0 (by definition) |
| Refusals | 19/100 | 100/100 |
- KL Divergence of 0.0785 indicates minimal capability loss — the model retains nearly all of its original intelligence.
- 19/100 refusals means ~81% of previously refused prompts are now answered. Remaining refusals are typically on the most extreme edge cases.
Model Details
- Base Model: Qwen3-4B-Instruct-2507
- Parameters: 4.0B (3.6B non-embedding)
- Layers: 36
- Context Length: 262,144 tokens
- Architecture: Dense transformer with GQA (32 Q-heads, 8 KV-heads)
- Mode: Non-thinking only (no
<think>blocks generated)
Quickstart
Using Transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "n0ctyx/Qwen3-4B-Instruct-uncensored"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto"
)
messages = [
{"role": "user", "content": "Your prompt here"}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
generated_ids = model.generate(
**model_inputs,
max_new_tokens=16384,
temperature=0.7,
top_p=0.8,
top_k=20,
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()
content = tokenizer.decode(output_ids, skip_special_tokens=True)
print(content)
Using vLLM
vllm serve n0ctyx/Qwen3-4B-Instruct-uncensored --max-model-len 32768
Then query the OpenAI-compatible API:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "n0ctyx/Qwen3-4B-Instruct-uncensored",
"messages": [{"role": "user", "content": "Hello!"}],
"temperature": 0.7,
"top_p": 0.8
}'
Using Ollama
# Create a Modelfile
echo 'FROM n0ctyx/Qwen3-4B-Instruct-uncensored' > Modelfile
ollama create qwen3-uncensored -f Modelfile
ollama run qwen3-uncensored
Using llama.cpp
Download the GGUF version (if available) and run:
./llama-cli -m qwen3-4b-uncensored.gguf -p "Your prompt here" -n 512
Recommended Settings
| Parameter | Value |
|---|---|
| Temperature | 0.7 |
| Top-P | 0.8 |
| Top-K | 20 |
| Min-P | 0 |
| Max Output Tokens | 16,384 |
| Repetition Penalty | 1.0 – 1.05 |
Use Cases
- Creative writing — fiction, roleplay, character dialogue without content restrictions
- Research — red-teaming, safety analysis, adversarial testing
- Dataset generation — generating synthetic training data for fine-tuning
- Unfiltered assistance — direct answers without hedging or refusals
Limitations
- Remaining 19% refusal rate on extreme prompts
- May occasionally produce inaccurate or hallucinated content (same as base model)
- 4B parameter model — for complex reasoning tasks, consider larger variants
- Uncensored does not mean infallible — use responsibly
Disclaimer
This model has had its safety alignment removed. It may generate harmful, offensive, or factually incorrect content. The creator is not responsible for any misuse. Use at your own risk and in compliance with applicable laws and regulations.
Acknowledgments
- Alibaba Qwen Team for the base Qwen3-4B-Instruct-2507 model
- Arditi et al. for the foundational research on refusal directions in LLMs
- Built using directional abliteration with TPE-based parameter optimization
- Downloads last month
- 65
Model tree for n0ctyx/Qwen3-4B-Instruct-Uncensored
Base model
Qwen/Qwen3-4B-Instruct-2507