Qwen3-Embedding-0.6B-Int8

This version of Qwen3-Embedding-0.6B has been converted to run on the Axera NPU using w8a16 quantization.

This model has been optimized with the following LoRA:

Compatible with Pulsar2 version: 4.1

Convert tools links:

For those who are interested in model conversion, you can try to export axmodel through the original repo: https://huggingface.co/Qwen/Qwen3-Embedding-0.6B

Pulsar2 Link, How to Convert LLM from Huggingface to axmodel

AXera NPU LLM Runtime

Support Platform

AX650
- AX650N DEMO Board
- M4N-Dock(爱芯派Pro)
- M.2 Accelerator card

Each subgraph is time-consuming

g1: 5.561 ms
g2: 9.140 ms
g3: 12.757 ms
g4: 16.446 ms
g5: 21.392 ms
g6: 23.712 ms
g7: 27.174 ms
g8: 30.897 ms
g9: 34.829 ms

Shortest time(forward) consumption: 5.561 ms
Longest time(forward) consumption: 181.908 ms
LayerNum: 28

Chips	ttft	w8a16
AX650	155.708 ms (128 token 最短耗时)	0.82 tokens/sec
AX650	5093.42 ms (1024 token 最长耗时)	0.20 tokens/sec

How to use

安装 axllm

方式一：克隆仓库后执行安装脚本：

git clone -b axllm https://github.com/AXERA-TECH/ax-llm.git
cd ax-llm
./install.sh

方式二：一行命令安装（默认分支 axllm）：

curl -fsSL https://raw.githubusercontent.com/AXERA-TECH/ax-llm/axllm/install.sh | bash

方式三：下载Github Actions CI 导出的可执行程序（适合没有编译环境的用户）：

如果没有编译环境，请到： https://github.com/AXERA-TECH/ax-llm/actions?query=branch%3Aaxllm 下载 最新 CI 导出的可执行程序（axllm），然后：

chmod +x axllm
sudo mv axllm /usr/bin/axllm

模型下载（Hugging Face）

mkdir -p AXERA-TECH/Qwen3-Embedding-0.6B
cd AXERA-TECH/Qwen3-Embedding-0.6B
hf download AXERA-TECH/Qwen3-Embedding-0.6B --local-dir .

# structure of the downloaded files
tree -L 1
.
├── README.md
├── config.json
├── model.embed_tokens.weight.bfloat16.bin
├── qwen3_p128_l0_together.axmodel
...
├── qwen3_p128_l27_together.axmodel
├── qwen3_post.axmodel
├── tokenizer.txt
└── python_backup/

1 directory, 33 files

Embedding（/v1/embeddings）使用说明

axllm 支持在 serve 模式下加载 Embedding 模型，并提供 OpenAI 兼容的 /v1/embeddings 接口（Embedding 模型不支持 run 交互模式）。

1. config.json 中的 Embedding 开关

本模型的 config.json 已包含 Embedding 标识：

{
  "is_embedding": true
}

2. 启动服务

axllm serve AXERA-TECH/Qwen3-Embedding-0.6B/ --port 8000

启动后会在终端打印本机可访问的完整 API URL（包含 127.0.0.1 与本机网卡 IP）。

3. 调用示例

# 健康检查
curl -s http://127.0.0.1:8000/health

# 获取已注册模型列表
curl -s http://127.0.0.1:8000/v1/models

# 生成 embedding (input 支持 string 或 string 数组)
curl -s http://127.0.0.1:8000/v1/embeddings \
    -H 'Content-Type: application/json' \
    -d '{"model":"AXERA-TECH/Qwen3-Embedding-0.6B","input":["hello","world"]}'

Python 推理（备份）

如需使用 Python 脚本进行推理调试，请参考 python_backup/ 目录下的说明文档。

Downloads last month: 54

Model tree for AXERA-TECH/Qwen3-Embedding-0.6B

Base model

Qwen/Qwen3-0.6B-Base

Finetuned

Qwen/Qwen3-Embedding-0.6B

Finetuned

(161)

this model