Instructions to use adamjen/Devstral-Small-2-24B-Opus-Reasoning with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use adamjen/Devstral-Small-2-24B-Opus-Reasoning with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="adamjen/Devstral-Small-2-24B-Opus-Reasoning",
	filename="Devstral-Small-2-24B-Opus-Reasoning.Q4_K_M.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use adamjen/Devstral-Small-2-24B-Opus-Reasoning with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf adamjen/Devstral-Small-2-24B-Opus-Reasoning:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf adamjen/Devstral-Small-2-24B-Opus-Reasoning:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf adamjen/Devstral-Small-2-24B-Opus-Reasoning:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf adamjen/Devstral-Small-2-24B-Opus-Reasoning:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf adamjen/Devstral-Small-2-24B-Opus-Reasoning:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf adamjen/Devstral-Small-2-24B-Opus-Reasoning:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf adamjen/Devstral-Small-2-24B-Opus-Reasoning:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf adamjen/Devstral-Small-2-24B-Opus-Reasoning:Q4_K_M

Use Docker

docker model run hf.co/adamjen/Devstral-Small-2-24B-Opus-Reasoning:Q4_K_M

LM Studio
Jan

vLLM

How to use adamjen/Devstral-Small-2-24B-Opus-Reasoning with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "adamjen/Devstral-Small-2-24B-Opus-Reasoning"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "adamjen/Devstral-Small-2-24B-Opus-Reasoning",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/adamjen/Devstral-Small-2-24B-Opus-Reasoning:Q4_K_M

Ollama
How to use adamjen/Devstral-Small-2-24B-Opus-Reasoning with Ollama:
```
ollama run hf.co/adamjen/Devstral-Small-2-24B-Opus-Reasoning:Q4_K_M
```

Unsloth Studio new

How to use adamjen/Devstral-Small-2-24B-Opus-Reasoning with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for adamjen/Devstral-Small-2-24B-Opus-Reasoning to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for adamjen/Devstral-Small-2-24B-Opus-Reasoning to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for adamjen/Devstral-Small-2-24B-Opus-Reasoning to start chatting

Pi new

How to use adamjen/Devstral-Small-2-24B-Opus-Reasoning with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf adamjen/Devstral-Small-2-24B-Opus-Reasoning:Q4_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "adamjen/Devstral-Small-2-24B-Opus-Reasoning:Q4_K_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use adamjen/Devstral-Small-2-24B-Opus-Reasoning with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf adamjen/Devstral-Small-2-24B-Opus-Reasoning:Q4_K_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default adamjen/Devstral-Small-2-24B-Opus-Reasoning:Q4_K_M

Run Hermes

hermes

Docker Model Runner
How to use adamjen/Devstral-Small-2-24B-Opus-Reasoning with Docker Model Runner:
```
docker model run hf.co/adamjen/Devstral-Small-2-24B-Opus-Reasoning:Q4_K_M
```

Lemonade

How to use adamjen/Devstral-Small-2-24B-Opus-Reasoning with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull adamjen/Devstral-Small-2-24B-Opus-Reasoning:Q4_K_M

Run and chat with the model

lemonade run user.Devstral-Small-2-24B-Opus-Reasoning-Q4_K_M

List all available models

lemonade list

How to use it with llama-server ?

by mancub - opened Mar 24

Discussion

mancub

Mar 24

I'm not having much luck running your quant through a llama-server into my VSCode - I keep getting back from the model:

Mistral 7B is a large language model created by Mistral AI.\n\nI'm trained to be secure, harmless and honest.\n\nMistral AI is a cutting-edge AI lab that trains models with outstanding performance on various benchmarks. I am their first published model.\n\nYou can learn more about me on my website: https://mistral.ai.\n\nNow, how can I help you?\n\n

adding --chat-template mistral or using your provided template --chat-template-file chat_template.jinja does not yield any difference either.

Any thoughts?

adamjen

Owner Mar 25

Im using llama-swap (I guess same thing)

devstral-opus-q5:
cmd: >
/media/adam/ubuntu_d/Apps/llama.cpp/build/bin/llama-server
-m "/media/adam/ubuntu_d/unsloth/Devstral-Small-2-24B-textonly_gguf/Devstral-Small-2-24B-Opus-Reasoning.Q5_K_M.gguf"
--alias devstral-opus-q5
--host 0.0.0.0 --port ${PORT}
--ctx-size 64000
--slot-save-path /media/adam/ubuntu_d/Apps/llama-swap/kv_cache/
-ngl 99
-fa on
--parallel 1
--batch-size 4096
--ubatch-size 2048
-ctk q4_0 -ctv q4_0
--temp 0.6 --top-p 0.95 --top-k 40 --min-p 0.05 --repeat-penalty 1.15
--defrag-thold 0.1
--cache-reuse 256
--chat-template-file /media/adam/ubuntu_d/unsloth/Devstral-Small-2-24B-textonly_gguf/chat_template.jinja
proxy: http://127.0.0.1:${PORT}

adamjen

Owner Mar 25

this is the jina template if its not working ask AI to help you for your use case

{#- Default system message if no system prompt is passed. #}
{%- set default_system_message = 'Think carefully step by step inside tags before giving your answer.' %}

{#- Begin of sequence token. #}
{{- bos_token }}

{#- Handle system prompt if it exists. #}
{%- if messages[0]['role'] == 'system' %}
{{- '[SYSTEM_PROMPT]' -}}
{%- if messages[0]['content'] is string %}
{{- messages[0]['content'] + '\n' + default_system_message -}}
{%- else %}
{%- for block in messages[0]['content'] %}
{%- if block['type'] == 'text' %}
{{- block['text'] }}
{%- endif %}
{%- endfor %}
{{- '\n' + default_system_message -}}
{%- endif %}
{{- '[/SYSTEM_PROMPT]' -}}
{%- set loop_messages = messages[1:] %}
{%- else %}
{%- set loop_messages = messages %}
{%- if default_system_message != '' %}
{{- '[SYSTEM_PROMPT]' + default_system_message + '[/SYSTEM_PROMPT]' }}
{%- endif %}
{%- endif %}

{#- Tools definition #}
{%- set tools_definition = '' %}
{%- set has_tools = false %}
{%- if tools is defined and tools is not none and tools|length > 0 %}
{%- set has_tools = true %}
{%- set tools_definition = '[AVAILABLE_TOOLS]' + (tools| tojson) + '[/AVAILABLE_TOOLS]' %}
{{- tools_definition }}
{%- endif %}

{#- [MODIFIED] Validation block removed to prevent 500 errors -#}

{#- Handle conversation messages. #}
{%- for message in loop_messages %}

{#- User messages supports text content or text and image chunks. #}
{%- if message['role'] == 'user' %}
    {%- if message['content'] is string %}
        {{- '[INST]' + message['content'] + '[/INST]' }}
    {%- elif message['content'] | length > 0 %}
        {{- '[INST]' }}
        {%- if message['content'] | length == 2 %}
            {%- set blocks = message['content'] | sort(attribute='type') %}
        {%- else %}
            {%- set blocks = message['content'] %}
        {%- endif %}
        {%- for block in blocks %}
            {%- if block['type'] == 'text' %}
                {{- block['text'] }}
            {%- elif block['type'] in ['image', 'image_url'] %}
                {{- '[IMG]' }}
            {%- endif %}
        {%- endfor %}
        {{- '[/INST]' }}
    {%- endif %}

{#- Assistant messages supports text content or text and image chunks. #}
{%- elif message['role'] == 'assistant' %}
    {%- if message['content'] is string %}
        {{- message['content'] }}
    {%- elif message['content'] | length > 0 %}
        {%- for block in message['content'] %}
            {%- if block['type'] == 'text' %}
                {{- block['text'] }}
            {%- endif %}
        {%- endfor %}
    {%- endif %}

    {%- if message['tool_calls'] is defined and message['tool_calls'] is not none and message['tool_calls']|length > 0 %}
        {%- for tool in message['tool_calls'] %}
            {%- set arguments = tool['function']['arguments'] %}
            {%- if arguments is not string %}
                {%- set arguments = arguments|tojson|safe %}
            {%- elif arguments == '' %}
                {%- set arguments = '{}' %}
            {%- endif %}
            {{- '[TOOL_CALLS]' + tool['function']['name'] + '[ARGS]' + arguments }}
        {%- endfor %}
    {%- endif %}

    {#- End of sequence token for each assistant messages. #}
    {{- eos_token }}

{#- Tool messages only supports text content. #}
{%- elif message['role'] == 'tool' %}
    {{- '[TOOL_RESULTS]' + message['content']|string + '[/TOOL_RESULTS]' }}

{%- endif %}

{%- endfor %}

mancub

Mar 25

Im using llama-swap (I guess same thing)

devstral-opus-q5:
cmd: >
/media/adam/ubuntu_d/Apps/llama.cpp/build/bin/llama-server
-m "/media/adam/ubuntu_d/unsloth/Devstral-Small-2-24B-textonly_gguf/Devstral-Small-2-24B-Opus-Reasoning.Q5_K_M.gguf"
--alias devstral-opus-q5
--host 0.0.0.0 --port ${PORT}
--ctx-size 64000
--slot-save-path /media/adam/ubuntu_d/Apps/llama-swap/kv_cache/
-ngl 99
-fa on
--parallel 1
--batch-size 4096
--ubatch-size 2048
-ctk q4_0 -ctv q4_0
--temp 0.6 --top-p 0.95 --top-k 40 --min-p 0.05 --repeat-penalty 1.15
--defrag-thold 0.1
--cache-reuse 256
--chat-template-file /media/adam/ubuntu_d/unsloth/Devstral-Small-2-24B-textonly_gguf/chat_template.jinja
proxy: http://127.0.0.1:${PORT}

Thanks for posting your chat template and the startup command line !

I asked Qwen3.5-35B to fix up the template a bit, not sure if it really made it better or not 😄 :

{#- Default system message if no system prompt is passed. #}
{%- set default_system_message = 'Think carefully step by step inside tags before giving your answer.' %}

{#- Begin of sequence token. #}
{{- bos_token }}

{#- Handle system prompt if it exists. #}
{%- if messages[0]['role'] == 'system' %}
    {{- '[SYSTEM_PROMPT]' }}
    {%- if messages[0]['content'] is string %}
        {{- messages[0]['content'] + '\n' + default_system_message -}}
    {%- else %}
        {%- for block in messages[0]['content'] %}
            {%- if block['type'] == 'text' %}
                {{- block['text'] }}
            {%- endif %}
        {%- endfor %}
        {{- '\n' + default_system_message -}}
    {%- endif %}
    {{- '[/SYSTEM_PROMPT]' }}
    {%- set loop_messages = messages[1:] %}
{%- else %}
    {%- set loop_messages = messages %}
    {%- if default_system_message != '' %}
        {{- '[SYSTEM_PROMPT]' + default_system_message + '[/SYSTEM_PROMPT]' }}
    {%- endif %}
{%- endif %}

{#- Tools definition #}
{%- set tools_definition = '' %}
{%- if tools is defined and tools is not none and tools|length > 0 %}
    {{- '[AVAILABLE_TOOLS]' }}
    {{- (tools|tojson)|string }}
    {{- '[/AVAILABLE_TOOLS]' }}
{%- endif %}

{#- Handle conversation messages. #}
{%- for message in loop_messages %}
    {#- User messages supports text content or text and image chunks. #}
    {%- if message['role'] == 'user' %}
        {{- '[INST]' }}
        {%- if message['content'] is string %}
            {{- message['content'] }}
        {%- elif message['content'] is iterable and message['content']|length > 0 %}
            {%- for block in message['content'] %}
                {%- if block['type'] == 'text' %}
                    {{- block['text'] }}
                {%- elif block['type'] in ['image', 'image_url', 'image_data'] %}
                    {{- '[IMG]' }}
                {%- endif %}
            {%- endfor %}
        {%- endif %}
        {{- '[/INST]' }}

    {#- Assistant messages supports text content or text and image chunks. #}
    {%- elif message['role'] == 'assistant' %}
        {%- if message['content'] is string %}
            {{- message['content'] }}
        {%- elif message['content'] is iterable and message['content']|length > 0 %}
            {%- for block in message['content'] %}
                {%- if block['type'] == 'text' %}
                    {{- block['text'] }}
                {%- endif %}
            {%- endfor %}
        {%- endif %}

        {#- Handle Tool Calls #}
        {%- if message.get('tool_calls') %}
            {%- for tool in message['tool_calls'] %}
                {%- set function_name = tool['function']['name'] %}
                {%- set arguments = tool['function']['arguments'] %}
                
                {#- Ensure arguments are a valid string for JSON #}
                {%- if arguments is not string %}
                    {%- set arguments = arguments|tojson %}
                {%- elif arguments == '' %}
                    {%- set arguments = '{}' %}
                {%- endif %}
                
                {{- '[TOOL_CALLS]' + function_name + '[ARGS]' + arguments }}
            {%- endfor %}
        {%- endif %}

        {#- End of sequence token for each assistant message. #}
        {{- eos_token }}

    {#- Tool messages only supports text content. #}
    {%- elif message['role'] == 'tool' %}
        {{- '[TOOL_RESULTS]' + message['content']|string + '[/TOOL_RESULTS]' }}
    {%- endif %}
{%- endfor %}

I'm curious though, you are quantizing your KV layers even more despite already loading a Q5. Is this to overcome the limits of your GPU or something else, and why not use your other Q4 quant then?

Another question is regarding the KV layers, in both Q4 and Q5 versions, V is left at Q6 while K is quantized down. I've read in a lot of places that K is more sensitive to quantization than V, and should be left higher if possible. Maybe reversing the K and V so K is at Q6 and V is at Q4 or Q5 would yield better results (there should be no change in overal model size)?

Also, why such a high temp at 0.6, do you need it to be more creative (the suggested temp is 0.15)?

Here's is how I'm testing it using ik_llama.cpp:

sync && echo 3 > sudo tee /proc/sys/vm/drop_caches free -h export CUDA_VISIBLE_DEVICES=0,1 export GGML_CUDA_GRAPH_OPT=1 ./build/bin/llama-server -ngl 99 -t 1 -c 131072 -sm graph -muge -ger -smf32 --max-gpu 2 --main-gpu 0 --model "models/adamjen_Devstral-Small-2-24B-Opus-Reasoning/Devstral-Small-2-24B-Opus-Reasoning.Q5_K_M.gguf" --jinja -np 1 --host 0.0.0.0 --port 8081 --api-key 12345 --alias "devstral-small-2" --temp 0.15 --top-p 0.95 --top-k 40 --min-p 0.01 --flash-attn on -cuda fa-offset=0 --seed 3407 --batch-size 4096 --ubatch-size 2048 --no-mmap --reasoning-tokens none --chat-template-kwargs "{"enable_thinking": false}" --chat-template-file "models/adamjen_Devstral-Small-2-24B-Opus-Reasoning/chat_template.jinja"

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment