8 tps on nVidia H200

#17

by svilen333 - opened 2 days ago

2 days ago

Hi, I am testing the model on 1 x nVidia H200 with latest vLLM, is it normal to get 8 tps using 128K context or I am doing something wrong?

malithh

2 days ago

Hi
That is not normal for sure, how many concurrent request are you doing?

svilen333

2 days ago

Only one request. Using the BF16 version.

malithh

2 days ago

•

edited 2 days ago

Yea then something is wrong, the auto calibrator might not have picked up the top_k and top parameters. Whats your input length and output length on test ?

svilen333

2 days ago

Input length 15 tokens, output is over 1000. Just gave task to code html+js simple task.

malithh

about 4 hours ago

•

edited about 4 hours ago

Question is there a big difference between IQuestLab/IQuest-Coder-V1-40B-Loop-Instruct and IQuestLab/IQuest-Coder-V1-40B-Instruct ? In terms of output

I’m part of an early-stage custom inference stack called DeployPad (https://www.geodd.io). We are able to run models 50% faster than vanilla. Our team was able to run IQuest-Coder-V1-40B-Instruct on single h200 at 80 tps per user. Would you like to try it out? This is still early beta and we’re running more tests, but the results so far look promising. Want to extend our support to Models from IQuestLab. Like the the Loop Instruct

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment