Instructions to use adamjen/Devstral-Small-2-24B-Opus-Reasoning with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use adamjen/Devstral-Small-2-24B-Opus-Reasoning with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="adamjen/Devstral-Small-2-24B-Opus-Reasoning", filename="Devstral-Small-2-24B-Opus-Reasoning.Q4_K_M.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use adamjen/Devstral-Small-2-24B-Opus-Reasoning with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf adamjen/Devstral-Small-2-24B-Opus-Reasoning:Q4_K_M # Run inference directly in the terminal: llama-cli -hf adamjen/Devstral-Small-2-24B-Opus-Reasoning:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf adamjen/Devstral-Small-2-24B-Opus-Reasoning:Q4_K_M # Run inference directly in the terminal: llama-cli -hf adamjen/Devstral-Small-2-24B-Opus-Reasoning:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf adamjen/Devstral-Small-2-24B-Opus-Reasoning:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf adamjen/Devstral-Small-2-24B-Opus-Reasoning:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf adamjen/Devstral-Small-2-24B-Opus-Reasoning:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf adamjen/Devstral-Small-2-24B-Opus-Reasoning:Q4_K_M
Use Docker
docker model run hf.co/adamjen/Devstral-Small-2-24B-Opus-Reasoning:Q4_K_M
- LM Studio
- Jan
- vLLM
How to use adamjen/Devstral-Small-2-24B-Opus-Reasoning with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "adamjen/Devstral-Small-2-24B-Opus-Reasoning" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "adamjen/Devstral-Small-2-24B-Opus-Reasoning", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/adamjen/Devstral-Small-2-24B-Opus-Reasoning:Q4_K_M
- Ollama
How to use adamjen/Devstral-Small-2-24B-Opus-Reasoning with Ollama:
ollama run hf.co/adamjen/Devstral-Small-2-24B-Opus-Reasoning:Q4_K_M
- Unsloth Studio new
How to use adamjen/Devstral-Small-2-24B-Opus-Reasoning with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for adamjen/Devstral-Small-2-24B-Opus-Reasoning to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for adamjen/Devstral-Small-2-24B-Opus-Reasoning to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for adamjen/Devstral-Small-2-24B-Opus-Reasoning to start chatting
- Pi new
How to use adamjen/Devstral-Small-2-24B-Opus-Reasoning with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf adamjen/Devstral-Small-2-24B-Opus-Reasoning:Q4_K_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "adamjen/Devstral-Small-2-24B-Opus-Reasoning:Q4_K_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use adamjen/Devstral-Small-2-24B-Opus-Reasoning with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf adamjen/Devstral-Small-2-24B-Opus-Reasoning:Q4_K_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default adamjen/Devstral-Small-2-24B-Opus-Reasoning:Q4_K_M
Run Hermes
hermes
- Docker Model Runner
How to use adamjen/Devstral-Small-2-24B-Opus-Reasoning with Docker Model Runner:
docker model run hf.co/adamjen/Devstral-Small-2-24B-Opus-Reasoning:Q4_K_M
- Lemonade
How to use adamjen/Devstral-Small-2-24B-Opus-Reasoning with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull adamjen/Devstral-Small-2-24B-Opus-Reasoning:Q4_K_M
Run and chat with the model
lemonade run user.Devstral-Small-2-24B-Opus-Reasoning-Q4_K_M
List all available models
lemonade list
How to use it with llama-server ?
I'm not having much luck running your quant through a llama-server into my VSCode - I keep getting back from the model:
Mistral 7B is a large language model created by Mistral AI.\n\nI'm trained to be secure, harmless and honest.\n\nMistral AI is a cutting-edge AI lab that trains models with outstanding performance on various benchmarks. I am their first published model.\n\nYou can learn more about me on my website: https://mistral.ai.\n\nNow, how can I help you?\n\n
adding --chat-template mistral or using your provided template --chat-template-file chat_template.jinja does not yield any difference either.
Any thoughts?
Im using llama-swap (I guess same thing)
devstral-opus-q5:
cmd: >
/media/adam/ubuntu_d/Apps/llama.cpp/build/bin/llama-server
-m "/media/adam/ubuntu_d/unsloth/Devstral-Small-2-24B-textonly_gguf/Devstral-Small-2-24B-Opus-Reasoning.Q5_K_M.gguf"
--alias devstral-opus-q5
--host 0.0.0.0 --port ${PORT}
--ctx-size 64000
--slot-save-path /media/adam/ubuntu_d/Apps/llama-swap/kv_cache/
-ngl 99
-fa on
--parallel 1
--batch-size 4096
--ubatch-size 2048
-ctk q4_0 -ctv q4_0
--temp 0.6 --top-p 0.95 --top-k 40 --min-p 0.05 --repeat-penalty 1.15
--defrag-thold 0.1
--cache-reuse 256
--chat-template-file /media/adam/ubuntu_d/unsloth/Devstral-Small-2-24B-textonly_gguf/chat_template.jinja
proxy: http://127.0.0.1:${PORT}
this is the jina template if its not working ask AI to help you for your use case
{#- Default system message if no system prompt is passed. #}
{%- set default_system_message = 'Think carefully step by step inside tags before giving your answer.' %}
{#- Begin of sequence token. #}
{{- bos_token }}
{#- Handle system prompt if it exists. #}
{%- if messages[0]['role'] == 'system' %}
{{- '[SYSTEM_PROMPT]' -}}
{%- if messages[0]['content'] is string %}
{{- messages[0]['content'] + '\n' + default_system_message -}}
{%- else %}
{%- for block in messages[0]['content'] %}
{%- if block['type'] == 'text' %}
{{- block['text'] }}
{%- endif %}
{%- endfor %}
{{- '\n' + default_system_message -}}
{%- endif %}
{{- '[/SYSTEM_PROMPT]' -}}
{%- set loop_messages = messages[1:] %}
{%- else %}
{%- set loop_messages = messages %}
{%- if default_system_message != '' %}
{{- '[SYSTEM_PROMPT]' + default_system_message + '[/SYSTEM_PROMPT]' }}
{%- endif %}
{%- endif %}
{#- Tools definition #}
{%- set tools_definition = '' %}
{%- set has_tools = false %}
{%- if tools is defined and tools is not none and tools|length > 0 %}
{%- set has_tools = true %}
{%- set tools_definition = '[AVAILABLE_TOOLS]' + (tools| tojson) + '[/AVAILABLE_TOOLS]' %}
{{- tools_definition }}
{%- endif %}
{#- [MODIFIED] Validation block removed to prevent 500 errors -#}
{#- Handle conversation messages. #}
{%- for message in loop_messages %}
{#- User messages supports text content or text and image chunks. #}
{%- if message['role'] == 'user' %}
{%- if message['content'] is string %}
{{- '[INST]' + message['content'] + '[/INST]' }}
{%- elif message['content'] | length > 0 %}
{{- '[INST]' }}
{%- if message['content'] | length == 2 %}
{%- set blocks = message['content'] | sort(attribute='type') %}
{%- else %}
{%- set blocks = message['content'] %}
{%- endif %}
{%- for block in blocks %}
{%- if block['type'] == 'text' %}
{{- block['text'] }}
{%- elif block['type'] in ['image', 'image_url'] %}
{{- '[IMG]' }}
{%- endif %}
{%- endfor %}
{{- '[/INST]' }}
{%- endif %}
{#- Assistant messages supports text content or text and image chunks. #}
{%- elif message['role'] == 'assistant' %}
{%- if message['content'] is string %}
{{- message['content'] }}
{%- elif message['content'] | length > 0 %}
{%- for block in message['content'] %}
{%- if block['type'] == 'text' %}
{{- block['text'] }}
{%- endif %}
{%- endfor %}
{%- endif %}
{%- if message['tool_calls'] is defined and message['tool_calls'] is not none and message['tool_calls']|length > 0 %}
{%- for tool in message['tool_calls'] %}
{%- set arguments = tool['function']['arguments'] %}
{%- if arguments is not string %}
{%- set arguments = arguments|tojson|safe %}
{%- elif arguments == '' %}
{%- set arguments = '{}' %}
{%- endif %}
{{- '[TOOL_CALLS]' + tool['function']['name'] + '[ARGS]' + arguments }}
{%- endfor %}
{%- endif %}
{#- End of sequence token for each assistant messages. #}
{{- eos_token }}
{#- Tool messages only supports text content. #}
{%- elif message['role'] == 'tool' %}
{{- '[TOOL_RESULTS]' + message['content']|string + '[/TOOL_RESULTS]' }}
{%- endif %}
{%- endfor %}
Im using llama-swap (I guess same thing)
devstral-opus-q5:
cmd: >
/media/adam/ubuntu_d/Apps/llama.cpp/build/bin/llama-server
-m "/media/adam/ubuntu_d/unsloth/Devstral-Small-2-24B-textonly_gguf/Devstral-Small-2-24B-Opus-Reasoning.Q5_K_M.gguf"
--alias devstral-opus-q5
--host 0.0.0.0 --port ${PORT}
--ctx-size 64000
--slot-save-path /media/adam/ubuntu_d/Apps/llama-swap/kv_cache/
-ngl 99
-fa on
--parallel 1
--batch-size 4096
--ubatch-size 2048
-ctk q4_0 -ctv q4_0
--temp 0.6 --top-p 0.95 --top-k 40 --min-p 0.05 --repeat-penalty 1.15
--defrag-thold 0.1
--cache-reuse 256
--chat-template-file /media/adam/ubuntu_d/unsloth/Devstral-Small-2-24B-textonly_gguf/chat_template.jinja
proxy: http://127.0.0.1:${PORT}
Thanks for posting your chat template and the startup command line !
I asked Qwen3.5-35B to fix up the template a bit, not sure if it really made it better or not π :
{#- Default system message if no system prompt is passed. #}
{%- set default_system_message = 'Think carefully step by step inside tags before giving your answer.' %}
{#- Begin of sequence token. #}
{{- bos_token }}
{#- Handle system prompt if it exists. #}
{%- if messages[0]['role'] == 'system' %}
{{- '[SYSTEM_PROMPT]' }}
{%- if messages[0]['content'] is string %}
{{- messages[0]['content'] + '\n' + default_system_message -}}
{%- else %}
{%- for block in messages[0]['content'] %}
{%- if block['type'] == 'text' %}
{{- block['text'] }}
{%- endif %}
{%- endfor %}
{{- '\n' + default_system_message -}}
{%- endif %}
{{- '[/SYSTEM_PROMPT]' }}
{%- set loop_messages = messages[1:] %}
{%- else %}
{%- set loop_messages = messages %}
{%- if default_system_message != '' %}
{{- '[SYSTEM_PROMPT]' + default_system_message + '[/SYSTEM_PROMPT]' }}
{%- endif %}
{%- endif %}
{#- Tools definition #}
{%- set tools_definition = '' %}
{%- if tools is defined and tools is not none and tools|length > 0 %}
{{- '[AVAILABLE_TOOLS]' }}
{{- (tools|tojson)|string }}
{{- '[/AVAILABLE_TOOLS]' }}
{%- endif %}
{#- Handle conversation messages. #}
{%- for message in loop_messages %}
{#- User messages supports text content or text and image chunks. #}
{%- if message['role'] == 'user' %}
{{- '[INST]' }}
{%- if message['content'] is string %}
{{- message['content'] }}
{%- elif message['content'] is iterable and message['content']|length > 0 %}
{%- for block in message['content'] %}
{%- if block['type'] == 'text' %}
{{- block['text'] }}
{%- elif block['type'] in ['image', 'image_url', 'image_data'] %}
{{- '[IMG]' }}
{%- endif %}
{%- endfor %}
{%- endif %}
{{- '[/INST]' }}
{#- Assistant messages supports text content or text and image chunks. #}
{%- elif message['role'] == 'assistant' %}
{%- if message['content'] is string %}
{{- message['content'] }}
{%- elif message['content'] is iterable and message['content']|length > 0 %}
{%- for block in message['content'] %}
{%- if block['type'] == 'text' %}
{{- block['text'] }}
{%- endif %}
{%- endfor %}
{%- endif %}
{#- Handle Tool Calls #}
{%- if message.get('tool_calls') %}
{%- for tool in message['tool_calls'] %}
{%- set function_name = tool['function']['name'] %}
{%- set arguments = tool['function']['arguments'] %}
{#- Ensure arguments are a valid string for JSON #}
{%- if arguments is not string %}
{%- set arguments = arguments|tojson %}
{%- elif arguments == '' %}
{%- set arguments = '{}' %}
{%- endif %}
{{- '[TOOL_CALLS]' + function_name + '[ARGS]' + arguments }}
{%- endfor %}
{%- endif %}
{#- End of sequence token for each assistant message. #}
{{- eos_token }}
{#- Tool messages only supports text content. #}
{%- elif message['role'] == 'tool' %}
{{- '[TOOL_RESULTS]' + message['content']|string + '[/TOOL_RESULTS]' }}
{%- endif %}
{%- endfor %}
I'm curious though, you are quantizing your KV layers even more despite already loading a Q5. Is this to overcome the limits of your GPU or something else, and why not use your other Q4 quant then?
Another question is regarding the KV layers, in both Q4 and Q5 versions, V is left at Q6 while K is quantized down. I've read in a lot of places that K is more sensitive to quantization than V, and should be left higher if possible. Maybe reversing the K and V so K is at Q6 and V is at Q4 or Q5 would yield better results (there should be no change in overal model size)?
Also, why such a high temp at 0.6, do you need it to be more creative (the suggested temp is 0.15)?
Here's is how I'm testing it using ik_llama.cpp:
sync && echo 3 > sudo tee /proc/sys/vm/drop_caches
free -h
export CUDA_VISIBLE_DEVICES=0,1
export GGML_CUDA_GRAPH_OPT=1
./build/bin/llama-server
-ngl 99
-t 1
-c 131072
-sm graph
-muge
-ger
-smf32
--max-gpu 2
--main-gpu 0
--model "models/adamjen_Devstral-Small-2-24B-Opus-Reasoning/Devstral-Small-2-24B-Opus-Reasoning.Q5_K_M.gguf"
--jinja
-np 1
--host 0.0.0.0
--port 8081
--api-key 12345
--alias "devstral-small-2"
--temp 0.15 --top-p 0.95 --top-k 40 --min-p 0.01
--flash-attn on
-cuda fa-offset=0
--seed 3407
--batch-size 4096
--ubatch-size 2048
--no-mmap
--reasoning-tokens none
--chat-template-kwargs "{"enable_thinking": false}"
--chat-template-file "models/adamjen_Devstral-Small-2-24B-Opus-Reasoning/chat_template.jinja"