Qwen3-Embedding-0.6B-Int8
This version of Qwen3-Embedding-0.6B has been converted to run on the Axera NPU using w8a16 quantization.
This model has been optimized with the following LoRA:
Compatible with Pulsar2 version: 4.1
Convert tools links:
For those who are interested in model conversion, you can try to export axmodel through the original repo: https://huggingface.co/Qwen/Qwen3-Embedding-0.6B
Pulsar2 Link, How to Convert LLM from Huggingface to axmodel
Support Platform
- AX650
- AX650N DEMO Board
- M4N-Dock(爱芯派Pro)
- M.2 Accelerator card
Each subgraph is time-consuming
g1: 5.561 ms
g2: 9.140 ms
g3: 12.757 ms
g4: 16.446 ms
g5: 21.392 ms
g6: 23.712 ms
g7: 27.174 ms
g8: 30.897 ms
g9: 34.829 ms
- Shortest time(forward) consumption: 5.561 ms
- Longest time(forward) consumption: 181.908 ms
- LayerNum: 28
| Chips | ttft | w8a16 |
|---|---|---|
| AX650 | 155.708 ms (128 token 最短耗时) | 0.82 tokens/sec |
| AX650 | 5093.42 ms (1024 token 最长耗时) | 0.20 tokens/sec |
How to use
安装 axllm
方式一:克隆仓库后执行安装脚本:
git clone -b axllm https://github.com/AXERA-TECH/ax-llm.git
cd ax-llm
./install.sh
方式二:一行命令安装(默认分支 axllm):
curl -fsSL https://raw.githubusercontent.com/AXERA-TECH/ax-llm/axllm/install.sh | bash
方式三:下载Github Actions CI 导出的可执行程序(适合没有编译环境的用户):
如果没有编译环境,请到:
https://github.com/AXERA-TECH/ax-llm/actions?query=branch%3Aaxllm
下载 最新 CI 导出的可执行程序(axllm),然后:
chmod +x axllm
sudo mv axllm /usr/bin/axllm
模型下载(Hugging Face)
先创建模型目录并进入,然后下载到该目录:
mkdir -p AXERA-TECH/Qwen3-Embedding-0.6B
cd AXERA-TECH/Qwen3-Embedding-0.6B
hf download AXERA-TECH/Qwen3-Embedding-0.6B --local-dir .
# structure of the downloaded files
tree -L 1
.
├── README.md
├── config.json
├── model.embed_tokens.weight.bfloat16.bin
├── qwen3_p128_l0_together.axmodel
...
├── qwen3_p128_l27_together.axmodel
├── qwen3_post.axmodel
├── tokenizer.txt
└── python_backup/
1 directory, 33 files
Embedding(/v1/embeddings)使用说明
axllm 支持在 serve 模式下加载 Embedding 模型,并提供 OpenAI 兼容的 /v1/embeddings 接口(Embedding 模型不支持 run 交互模式)。
1. config.json 中的 Embedding 开关
本模型的 config.json 已包含 Embedding 标识:
{
"is_embedding": true
}
2. 启动服务
axllm serve AXERA-TECH/Qwen3-Embedding-0.6B/ --port 8000
启动后会在终端打印本机可访问的完整 API URL(包含 127.0.0.1 与本机网卡 IP)。
3. 调用示例
# 健康检查
curl -s http://127.0.0.1:8000/health
# 获取已注册模型列表
curl -s http://127.0.0.1:8000/v1/models
# 生成 embedding (input 支持 string 或 string 数组)
curl -s http://127.0.0.1:8000/v1/embeddings \
-H 'Content-Type: application/json' \
-d '{"model":"AXERA-TECH/Qwen3-Embedding-0.6B","input":["hello","world"]}'
Python 推理(备份)
如需使用 Python 脚本进行推理调试,请参考 python_backup/ 目录下的说明文档。
- Downloads last month
- 54