Instructions to use goasty/Qwen2.5-1.5B-Instruct_GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use goasty/Qwen2.5-1.5B-Instruct_GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="goasty/Qwen2.5-1.5B-Instruct_GGUF",
	filename="Qwen2.5-1.5B-Instruct_F16.gguf",
)

llm.create_chat_completion(
	messages = "No input example has been defined for this model task."
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use goasty/Qwen2.5-1.5B-Instruct_GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf goasty/Qwen2.5-1.5B-Instruct_GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf goasty/Qwen2.5-1.5B-Instruct_GGUF:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf goasty/Qwen2.5-1.5B-Instruct_GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf goasty/Qwen2.5-1.5B-Instruct_GGUF:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf goasty/Qwen2.5-1.5B-Instruct_GGUF:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf goasty/Qwen2.5-1.5B-Instruct_GGUF:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf goasty/Qwen2.5-1.5B-Instruct_GGUF:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf goasty/Qwen2.5-1.5B-Instruct_GGUF:Q4_K_M

Use Docker

docker model run hf.co/goasty/Qwen2.5-1.5B-Instruct_GGUF:Q4_K_M

LM Studio
Jan
Ollama
How to use goasty/Qwen2.5-1.5B-Instruct_GGUF with Ollama:
```
ollama run hf.co/goasty/Qwen2.5-1.5B-Instruct_GGUF:Q4_K_M
```

Unsloth Studio

How to use goasty/Qwen2.5-1.5B-Instruct_GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for goasty/Qwen2.5-1.5B-Instruct_GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for goasty/Qwen2.5-1.5B-Instruct_GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for goasty/Qwen2.5-1.5B-Instruct_GGUF to start chatting

How to use goasty/Qwen2.5-1.5B-Instruct_GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf goasty/Qwen2.5-1.5B-Instruct_GGUF:Q4_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "goasty/Qwen2.5-1.5B-Instruct_GGUF:Q4_K_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use goasty/Qwen2.5-1.5B-Instruct_GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf goasty/Qwen2.5-1.5B-Instruct_GGUF:Q4_K_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default goasty/Qwen2.5-1.5B-Instruct_GGUF:Q4_K_M

Run Hermes

hermes

Docker Model Runner
How to use goasty/Qwen2.5-1.5B-Instruct_GGUF with Docker Model Runner:
```
docker model run hf.co/goasty/Qwen2.5-1.5B-Instruct_GGUF:Q4_K_M
```

Lemonade

How to use goasty/Qwen2.5-1.5B-Instruct_GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull goasty/Qwen2.5-1.5B-Instruct_GGUF:Q4_K_M

Run and chat with the model

lemonade run user.Qwen2.5-1.5B-Instruct_GGUF-Q4_K_M

List all available models

lemonade list

Qwen2.5-1.5B-Instruct — Quantized GGUF Models

This repository provides GGUF-quantized variants of Qwen2.5-1.5B-Instruct, optimized for efficient inference across a wide range of hardware — from modern GPUs to low-memory CPUs and edge devices.

The goal of these quantizations is to significantly reduce memory and compute requirements while retaining the strong instruction-following and reasoning behavior of the base model.

Model Summary

Base Model: Qwen2.5-1.5B-Instruct
Architecture: Decoder-only Transformer
Parameter Count: ~1.5B
Modalities: Text
Context Length: Up to 32K tokens (backend dependent)
Developer: Qwen Team (Alibaba Cloud)
License: Apache-2.0
Languages: Multilingual (English, Chinese, others)

Available Quantizations

Multiple GGUF quantization levels are provided to support different performance, memory, and accuracy requirements.

Q2_K (2-bit)

Extremely small memory footprint
Enables inference on very constrained devices
Suitable for experimentation or ultra-low-resource environments
Significant quality degradation compared to higher bit-rates

Q3_K_M (3-bit)

Slightly improved quality over Q2_K
Still very lightweight and fast
Reasoning and instruction accuracy noticeably reduced
Best for basic conversational or lightweight tasks

Q4_K_M (4-bit)

Strong efficiency-to-quality ratio
Works well on CPUs and low-VRAM GPUs
Suitable for general chat and instruction tasks
Moderate quality loss in complex reasoning

Q5_K_M (5-bit)

Good balance between size and output quality
Retains most instruction-following capabilities
Recommended default for local usage

Q6_K (6-bit)

Higher fidelity responses
Increased memory usage compared to 5-bit
Better suited for reasoning-heavy prompts

Q8_0 (8-bit)

Near FP16-level quality
Largest quantized variant
Best choice when memory allows and accuracy is critical

Actual performance depends on inference backend, context length, sampling parameters, and prompt complexity.

Why Use Quantized Qwen2.5?

Efficient instruction-following with low latency
Capable reasoning even at reduced precision
Runs entirely offline
Scales from laptops to edge devices
Flexible deployment via GGUF-compatible runtimes

These models are ideal for local assistants, offline chat applications, research, and resource-constrained environments.

Usage Example

llama.cpp (GGUF)

./llama-cli \
  -m qwen2.5-1.5b-instruct-q5_k_m.gguf \
  -p "Explain the difference between supervised and unsupervised learning." \
  -n 256 \
  -c 8192

Recommended Settings

Prefer Q5_K_M or higher for reasoning tasks
Use lower bit-rates (Q2_K, Q3_K_M) only when memory is extremely limited
Temperature range: 0.6 – 0.8 for balanced outputs

Training Data (Base Model)

The original Qwen2.5-1.5B-Instruct model was trained and fine-tuned on a diverse mixture of:

Instruction-following datasets
Multilingual general-knowledge corpora
Reasoning-focused synthetic data
Conversational and task-oriented examples

Quantization applies numerical compression only and does not alter training data or model behavior intentionally.

Recommended Applications

Offline AI assistants
Local chat and analysis tools
Educational experimentation
CPU-only or low-VRAM environments
Embedded and edge deployments

Known Limitations

Lower bit-rate models may hallucinate more frequently
Q2_K and Q3_K_M are not suitable for complex reasoning
Not intended for safety-critical or high-risk decision making

Always validate performance on your specific workload.

Acknowledgements

Qwen Team for releasing the Qwen2.5 model family
The llama.cpp community for GGUF tooling and inference support
Open-source contributors enabling efficient local LLM deployment

Contact

For issues related to quantization files or deployment guidance, please open an issue in this repository.

Downloads last month: 34

GGUF

Model size

2B params

Architecture

qwen2

Hardware compatibility

2-bit

3-bit

4-bit

5-bit

6-bit

8-bit

16-bit

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for goasty/Qwen2.5-1.5B-Instruct_GGUF

Base model

Qwen/Qwen2.5-1.5B

Finetuned

Qwen/Qwen2.5-1.5B-Instruct

Quantized

(198)

this model