Instructions to use nvidia/Nemotron-Cascade-14B-Thinking with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use nvidia/Nemotron-Cascade-14B-Thinking with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="nvidia/Nemotron-Cascade-14B-Thinking")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("nvidia/Nemotron-Cascade-14B-Thinking")
model = AutoModelForCausalLM.from_pretrained("nvidia/Nemotron-Cascade-14B-Thinking")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Inference
Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use nvidia/Nemotron-Cascade-14B-Thinking with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "nvidia/Nemotron-Cascade-14B-Thinking"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nvidia/Nemotron-Cascade-14B-Thinking",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/nvidia/Nemotron-Cascade-14B-Thinking

SGLang

How to use nvidia/Nemotron-Cascade-14B-Thinking with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "nvidia/Nemotron-Cascade-14B-Thinking" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nvidia/Nemotron-Cascade-14B-Thinking",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "nvidia/Nemotron-Cascade-14B-Thinking" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "nvidia/Nemotron-Cascade-14B-Thinking",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use nvidia/Nemotron-Cascade-14B-Thinking with Docker Model Runner:
```
docker model run hf.co/nvidia/Nemotron-Cascade-14B-Thinking
```

Question about LiveCodeBench evaluation setup and code RL

by s2580 - opened Dec 26, 2025

Discussion

s2580

Dec 26, 2025

On the report, the LiveCodeBench score is listed as 65.9% for IF-RL, but when we run the evaluation ourselves, we can only reproduce around 37.7%. Could you please share the exact evaluation configuration used for the reported number, such as timeout (per test / per problem).
In addition, could you share the prompt setup used for evaluation? If possible, could you also open-source the evaluation code (or provide a script/config/command) so the results can be reproduced reliably?

Also, I noticed the code RL training datahas not been released yet. Is there any plan to release (or partially release) the code RL training dataset?

zhuoliny

NVIDIA org Jan 6

Hi @s2580 I thought everything about reproducing our eval results can be found here: https://huggingface.co/nvidia/Nemotron-Cascade-14B-Thinking/blob/main/evaluation/README.md. Could you please check it carefully? Everything you asked (prompt/eval config/scripts/command) can be found in this subfolder.

Regarding the release of Code-RL dataset, it contains data that we internally purchased from official CP platforms. We are making efforts on releasing part of the data in the future.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment