8 tps on nVidia H200

#17
by svilen333 - opened

Hi, I am testing the model on 1 x nVidia H200 with latest vLLM, is it normal to get 8 tps using 128K context or I am doing something wrong?

Hi
That is not normal for sure, how many concurrent request are you doing?

Only one request. Using the BF16 version.

Yea then something is wrong, the auto calibrator might not have picked up the top_k and top parameters. Whats your input length and output length on test ?

Input length 15 tokens, output is over 1000. Just gave task to code html+js simple task.

Question is there a big difference between IQuestLab/IQuest-Coder-V1-40B-Loop-Instruct and IQuestLab/IQuest-Coder-V1-40B-Instruct ? In terms of output

I’m part of an early-stage custom inference stack called DeployPad (https://www.geodd.io). We are able to run models 50% faster than vanilla. Our team was able to run IQuest-Coder-V1-40B-Instruct on single h200 at 80 tps per user. Would you like to try it out? This is still early beta and we’re running more tests, but the results so far look promising. Want to extend our support to Models from IQuestLab. Like the the Loop Instruct

Sign up or log in to comment