Qwen3-Embedding-0.6B-Int8

This version of Qwen3-Embedding-0.6B has been converted to run on the Axera NPU using w8a16 quantization.

This model has been optimized with the following LoRA:

Compatible with Pulsar2 version: 4.1

Convert tools links:

For those who are interested in model conversion, you can try to export axmodel through the original repo: https://huggingface.co/Qwen/Qwen3-Embedding-0.6B

Pulsar2 Link, How to Convert LLM from Huggingface to axmodel

AXera NPU LLM Runtime

Support Platform

Each subgraph is time-consuming

g1: 5.561 ms
g2: 9.140 ms
g3: 12.757 ms
g4: 16.446 ms
g5: 21.392 ms
g6: 23.712 ms
g7: 27.174 ms
g8: 30.897 ms
g9: 34.829 ms
  • Shortest time(forward) consumption: 5.561 ms
  • Longest time(forward) consumption: 181.908 ms
  • LayerNum: 28
Chips ttft w8a16
AX650 155.708 ms (128 token 最短耗时) 0.82 tokens/sec
AX650 5093.42 ms (1024 token 最长耗时) 0.20 tokens/sec

How to use

安装 axllm

方式一:克隆仓库后执行安装脚本:

git clone -b axllm https://github.com/AXERA-TECH/ax-llm.git
cd ax-llm
./install.sh

方式二:一行命令安装(默认分支 axllm):

curl -fsSL https://raw.githubusercontent.com/AXERA-TECH/ax-llm/axllm/install.sh | bash

方式三:下载Github Actions CI 导出的可执行程序(适合没有编译环境的用户):

如果没有编译环境,请到: https://github.com/AXERA-TECH/ax-llm/actions?query=branch%3Aaxllm 下载 最新 CI 导出的可执行程序axllm),然后:

chmod +x axllm
sudo mv axllm /usr/bin/axllm

模型下载(Hugging Face)

先创建模型目录并进入,然后下载到该目录:

mkdir -p AXERA-TECH/Qwen3-Embedding-0.6B
cd AXERA-TECH/Qwen3-Embedding-0.6B
hf download AXERA-TECH/Qwen3-Embedding-0.6B --local-dir .

# structure of the downloaded files
tree -L 1
.
├── README.md
├── config.json
├── model.embed_tokens.weight.bfloat16.bin
├── qwen3_p128_l0_together.axmodel
...
├── qwen3_p128_l27_together.axmodel
├── qwen3_post.axmodel
├── tokenizer.txt
└── python_backup/

1 directory, 33 files

Embedding(/v1/embeddings)使用说明

axllm 支持在 serve 模式下加载 Embedding 模型,并提供 OpenAI 兼容的 /v1/embeddings 接口(Embedding 模型不支持 run 交互模式)。

1. config.json 中的 Embedding 开关

本模型的 config.json 已包含 Embedding 标识:

{
  "is_embedding": true
}

2. 启动服务

axllm serve AXERA-TECH/Qwen3-Embedding-0.6B/ --port 8000

启动后会在终端打印本机可访问的完整 API URL(包含 127.0.0.1 与本机网卡 IP)。

3. 调用示例

# 健康检查
curl -s http://127.0.0.1:8000/health

# 获取已注册模型列表
curl -s http://127.0.0.1:8000/v1/models

# 生成 embedding (input 支持 string 或 string 数组)
curl -s http://127.0.0.1:8000/v1/embeddings \
    -H 'Content-Type: application/json' \
    -d '{"model":"AXERA-TECH/Qwen3-Embedding-0.6B","input":["hello","world"]}'

Python 推理(备份)

如需使用 Python 脚本进行推理调试,请参考 python_backup/ 目录下的说明文档。

Downloads last month
54
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for AXERA-TECH/Qwen3-Embedding-0.6B

Finetuned
(161)
this model