Once again Thanks, here is my review for 8 x RTX 5090 setup
for RTX 5090 setup , I observed slower TPS when enabled MTP
--speculative-config.method mtp
--speculative-config.num_speculative_tokens 1
Avg 25 TPS
With limited Vram of 256GB , I think RTX 5090 users should turn it off for more KV Cache and better performance?
Once I disabled MTP I attained 35-41 TPS (ubuntu GDM, I expect a little more TPS if headless setup)
---- Without MTP enabled -----
(APIServer pid=21036) INFO: 127.0.0.1:50526 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=21036) INFO 12-25 02:01:12 [loggers.py:257] Engine 000: Avg prompt throughput: 1383.7 tokens/s, Avg generation throughput: 13.8 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 9.1%, Prefix cache hit rate: 0.0%
(APIServer pid=21036) INFO 12-25 02:01:22 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 40.0 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 9.3%, Prefix cache hit rate: 0.0%
(APIServer pid=21036) INFO 12-25 02:01:32 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 41.9 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 9.6%, Prefix cache hit rate: 0.0%
---with MTP enabled ------
(APIServer pid=19990) INFO: Application startup complete.
(APIServer pid=19990) INFO: 127.0.0.1:35818 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=19990) INFO 12-25 01:51:58 [loggers.py:257] Engine 000: Avg prompt throughput: 1260.1 tokens/s, Avg generation throughput: 13.2 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 10.3%, Prefix cache hit rate: 0.0%
(APIServer pid=19990) INFO 12-25 01:51:58 [metrics.py:100] SpecDecoding metrics: Mean acceptance length: 1.89, Accepted throughput: 0.77 tokens/s, Drafted throughput: 0.86 tokens/s, Accepted: 62 tokens, Drafted: 70 tokens, Per-position acceptance rate: 0.886, Avg Draft acceptance rate: 88.6%
(APIServer pid=19990) INFO 12-25 01:52:08 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 27.2 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 10.6%, Prefix cache hit rate: 0.0%
(APIServer pid=19990) INFO 12-25 01:52:08 [metrics.py:100] SpecDecoding metrics: Mean acceptance length: 1.78, Accepted throughput: 11.90 tokens/s, Drafted throughput: 15.30 tokens/s, Accepted: 119 tokens, Drafted: 153 tokens, Per-position acceptance rate: 0.778, Avg Draft acceptance rate: 77.8%
(APIServer pid=19990) INFO 12-25 01:52:18 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 27.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 10.8%, Prefix cache hit rate: 0.0%
(APIServer pid=19990) INFO 12-25 01:52:18 [metrics.py:100] SpecDecoding metrics: Mean acceptance length: 1.78, Accepted throughput: 11.90 tokens/s, Drafted throughput: 15.20 tokens/s, Accepted: 119 tokens, Drafted: 152 tokens, Per-position acceptance rate: 0.783, Avg Draft acceptance rate: 78.3%
(APIServer pid=19990) INFO 12-25 01:52:28 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 28.5 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 11.0%, Prefix cache hit rate: 0.0%
(APIServer pid=19990) INFO 12-25 01:52:28 [metrics.py:100] SpecDecoding metrics: Mean acceptance length: 1.79, Accepted throughput: 12.60 tokens/s, Drafted throughput: 15.90 tokens/s, Accepted: 126 tokens, Drafted: 159 tokens, Per-position acceptance rate: 0.792, Avg Draft acceptance rate: 79.2%
(APIServer pid=19990) INFO 12-25 01:52:38 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.6 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=19990) INFO 12-25 01:52:38 [metrics.py:100] SpecDecoding metrics: Mean acceptance length: 2.00, Accepted throughput: 0.30 tokens/s, Drafted throughput: 0.30 tokens/s, Accepted: 3 tokens, Drafted: 3 tokens, Per-position acceptance rate: 1.000, Avg Draft acceptance rate: 100.0%
(APIServer pid=19990) INFO 12-25 01:52:48 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
am i doing anything wrong? just in case
emmm I guess you would have to let go "export VLLM_USE_DEEP_GEMM=0", and give it a try. But I'm not sure.
I ran a test on my 8x4090(48GB) rig:
---- w/o MTP enabled -----
(APIServer pid=143220) INFO 12-25 15:24:40 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 53.9 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.2%, Prefix cache hit rate: 0.0%
(APIServer pid=143220) INFO 12-25 15:24:50 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 54.0 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.3%, Prefix cache hit rate: 0.0%
(APIServer pid=143220) INFO 12-25 15:25:00 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 53.9 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.4%, Prefix cache hit rate: 0.0%
---with MTP enabled ------
(APIServer pid=136891) INFO 12-25 15:17:58 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 73.7 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.2%, Prefix cache hit rate: 0.0%
(APIServer pid=136891) INFO 12-25 15:17:58 [metrics.py:100] SpecDecoding metrics: Mean acceptance length: 1.66, Accepted throughput: 29.30 tokens/s, Drafted throughput: 44.39 tokens/s, Accepted: 293 tokens, Drafted: 444 tokens, Per-position acceptance rate: 0.660, Avg Draft acceptance rate: 66.0%
(APIServer pid=136891) INFO 12-25 15:18:08 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 66.6 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.4%, Prefix cache hit rate: 0.0%
(APIServer pid=136891) INFO 12-25 15:18:08 [metrics.py:100] SpecDecoding metrics: Mean acceptance length: 1.63, Accepted throughput: 25.80 tokens/s, Drafted throughput: 40.80 tokens/s, Accepted: 258 tokens, Drafted: 408 tokens, Per-position acceptance rate: 0.632, Avg Draft acceptance rate: 63.2%
(APIServer pid=136891) INFO 12-25 15:18:18 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 62.7 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.5%, Prefix cache hit rate: 0.0%
(APIServer pid=136891) INFO 12-25 15:18:18 [metrics.py:100] SpecDecoding metrics: Mean acceptance length: 1.65, Accepted throughput: 24.60 tokens/s, Drafted throughput: 38.10 tokens/s, Accepted: 246 tokens, Drafted: 381 tokens, Per-position acceptance rate: 0.646, Avg Draft acceptance rate: 64.6%
(APIServer pid=136891) INFO 12-25 15:18:28 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 58.6 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.7%, Prefix cache hit rate: 0.0%
My results look reasonable, for reference.
Thanks, tried and still worst. i will disable MTP for my setup. not sure what causes that to slow down
Thanks, tried and still worst. i will disable MTP for my setup. not sure what causes that to slow down
I had the same issue with speculative decoding on GLM-4.5V-FP8 (official quant) with RTX Pro 6000: https://github.com/vllm-project/vllm/issues/26838#issuecomment-3563172299 so I'm not sure if it's the quant itself that's problematic.
Edit: Ah, but in this specific unofficial quant case, a MTP layer would predict what the unquantized or the FP8 models would output, it would need to be retrained for a quant and obviously this is quite complex. So for any quant, the MTP layer likely should be stripped or mispredictions would lead to more work.
Thanks, tried and still worst. i will disable MTP for my setup. not sure what causes that to slow down
in your case, the most prominent issue is, your 8x5090 runs slower than my 8x4090, even without mtp enabled...
my rig runs on 53.9 tokens/s, but yours 40.0 tokens/s
Thanks, tried and still worst. i will disable MTP for my setup. not sure what causes that to slow down
I had the same issue with speculative decoding on GLM-4.5V-FP8 (official quant) with RTX Pro 6000: https://github.com/vllm-project/vllm/issues/26838#issuecomment-3563172299 so I'm not sure if it's the quant itself that's problematic.
Edit: Ah, but in this specific unofficial quant case, a MTP layer would predict what the unquantized or the FP8 models would output, it would need to be retrained for a quant and obviously this is quite complex. So for any quant, the MTP layer likely should be stripped or mispredictions would lead to more work.
"mispredictions would lead to more work." at the least, this is not the case, as you can see from Avg Draft acceptance rate, which is comparable to that of the bf16 version.
"mispredictions would lead to more work." at the least, this is not the case, as you can see from Avg Draft acceptance rate, which is comparable to that of the bf16 version.
Ah good catch
I ran a test on my 8x4090(48GB) rig:
will i be able to run it on 2X6000 Pro (192 GB total VRAM) rig?
How do i quantize it to be under that ram?
Thanks, tried and still worst. i will disable MTP for my setup. not sure what causes that to slow down
in your case, the most prominent issue is, your 8x5090 runs slower than my 8x4090, even without mtp enabled...
my rig runs on 53.9 tokens/s, but yours 40.0 tokens/s
i re-setup my rig with proper chassis , 8 x PCIe properly connected
update latest vllm nightly
now it seem hitting 40- 50 tps , i think vllm is not yet fully optimized for RTX 5090 but the speed and quality seem good
speculative coding also works for me now hitting up to 60-70tps. not as good but happy enough
Tried tp =8 ,going to
just like to ask:
when you train this model , did you expose the thinking together with outputs? it would be nice to conceal thinking though
latest result
(APIServer pid=22840) INFO 01-03 17:19:18 [loggers.py:257] Engine 000: Avg prompt throughput: 264.8 tokens/s, Avg generation throughput: 37.0 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 1.3%, Prefix cache hit rate: 0.0%
(APIServer pid=22840) INFO 01-03 17:19:18 [metrics.py:100] SpecDecoding metrics: Mean acceptance length: 1.71, Accepted throughput: 3.02 tokens/s, Drafted throughput: 4.24 tokens/s, Accepted: 154 tokens, Drafted: 216 tokens, Per-position acceptance rate: 0.713, Avg Draft acceptance rate: 71.3%
(APIServer pid=22840) INFO 01-03 17:19:28 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 65.0 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 1.5%, Prefix cache hit rate: 0.0%
(APIServer pid=22840) INFO 01-03 17:19:28 [metrics.py:100] SpecDecoding metrics: Mean acceptance length: 1.67, Accepted throughput: 26.10 tokens/s, Drafted throughput: 38.90 tokens/s, Accepted: 261 tokens, Drafted: 389 tokens, Per-position acceptance rate: 0.671, Avg Draft acceptance rate: 67.1%
(APIServer pid=22840) INFO 01-03 17:19:38 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 63.8 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 1.8%, Prefix cache hit rate: 0.0%
(APIServer pid=22840) INFO 01-03 17:19:38 [metrics.py:100] SpecDecoding metrics: Mean acceptance length: 1.64, Accepted throughput: 24.90 tokens/s, Drafted throughput: 38.90 tokens/s, Accepted: 249 tokens, Drafted: 389 tokens, Per-position acceptance rate: 0.640, Avg Draft acceptance rate: 64.0%
(APIServer pid=22840) INFO 01-03 17:19:48 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 64.2 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 2.1%, Prefix cache hit rate: 0.0%
(APIServer pid=22840) INFO 01-03 17:19:48 [metrics.py:100] SpecDecoding metrics: Mean acceptance length: 1.65, Accepted throughput: 25.40 tokens/s, Drafted throughput: 38.80 tokens/s, Accepted: 254 tokens, Drafted: 388 tokens, Per-position acceptance rate: 0.655, Avg Draft acceptance rate: 65.5%
(APIServer pid=22840) INFO 01-03 17:19:58 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 63.8 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 2.3%, Prefix cache hit rate: 0.0%
(APIServer pid=22840) INFO 01-03 17:19:58 [metrics.py:100] SpecDecoding metrics: Mean acceptance length: 1.64, Accepted throughput: 24.90 tokens/s, Drafted throughput: 38.90 tokens/s, Accepted: 249 tokens, Drafted: 389 tokens, Per-position acceptance rate: 0.640, Avg Draft acceptance rate: 64.0%
(APIServer pid=22840) INFO 01-03 17:20:08 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 64.2 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 2.6%, Prefix cache hit rate: 0.0%
(APIServer pid=22840) INFO 01-03 17:20:08 [metrics.py:100] SpecDecoding metrics: Mean acceptance length: 1.65, Accepted throughput: 25.40 tokens/s, Drafted throughput: 38.80 tokens/s, Accepted: 254 tokens, Drafted: 388 tokens, Per-position acceptance rate: 0.655, Avg Draft acceptance rate: 65.5%
(APIServer pid=22840) INFO: 172.17.0.2:60184 - "GET /v1/models HTTP/1.1" 200 OK
(APIServer pid=22840) INFO 01-03 17:20:18 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 51.5 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.8%, Prefix cache hit rate: 0.0%
(APIServer pid=22840) INFO 01-03 17:20:18 [metrics.py:100] SpecDecoding metrics: Mean acceptance length: 1.65, Accepted throughput: 20.30 tokens/s, Drafted throughput: 31.20 tokens/s, Accepted: 203 tokens, Drafted: 312 tokens, Per-position acceptance rate: 0.651, Avg Draft acceptance rate: 65.1%
(APIServer pid=22840) INFO 01-03 17:20:28 [loggers.py:257] Engine 000: Avg prompt throughput: 695.1 tokens/s, Avg generation throughput: 46.3 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 3.1%, Prefix cache hit rate: 0.0%
(APIServer pid=22840) INFO 01-03 17:20:28 [metrics.py:100] SpecDecoding metrics: Mean acceptance length: 1.65, Accepted throughput: 18.30 tokens/s, Drafted throughput: 28.00 tokens/s, Accepted: 183 tokens, Drafted: 280 tokens, Per-position acceptance rate: 0.654, Avg Draft acceptance rate: 65.4%
(APIServer pid=22840) INFO 01-03 17:20:38 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 67.3 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 3.4%, Prefix cache hit rate: 0.0%
(APIServer pid=22840) INFO 01-03 17:20:38 [metrics.py:100] SpecDecoding metrics: Mean acceptance length: 1.74, Accepted throughput: 28.70 tokens/s, Drafted throughput: 38.60 tokens/s, Accepted: 287 tokens, Drafted: 386 tokens, Per-position acceptance rate: 0.744, Avg Draft acceptance rate: 74.4%
(APIServer pid=22840) INFO: 172.17.0.2:60198 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=22840) INFO 01-03 17:20:48 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 4.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=22840) INFO 01-03 17:20:48 [metrics.py:100] SpecDecoding metrics: Mean acceptance length: 1.82, Accepted throughput: 1.80 tokens/s, Drafted throughput: 2.20 tokens/s, Accepted: 18 tokens, Drafted: 22 tokens, Per-position acceptance rate: 0.818, Avg Draft acceptance rate: 81.8%
(APIServer pid=22840) INFO 01-03 17:20:58 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=22840) INFO: 172.17.0.2:56364 - "GET /v1/models HTTP/1.1" 200 OK
(APIServer pid=22840) INFO: 172.17.0.2:56372 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=22840) INFO 01-03 17:21:58 [loggers.py:257] Engine 000: Avg prompt throughput: 2.8 tokens/s, Avg generation throughput: 62.9 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.3%, Prefix cache hit rate: 0.0%
(APIServer pid=22840) INFO 01-03 17:21:58 [metrics.py:100] SpecDecoding metrics: Mean acceptance length: 1.77, Accepted throughput: 3.90 tokens/s, Drafted throughput: 5.07 tokens/s, Accepted: 273 tokens, Drafted: 355 tokens, Per-position acceptance rate: 0.769, Avg Draft acceptance rate: 76.9%
(APIServer pid=22840) INFO 01-03 17:22:08 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 68.4 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.6%, Prefix cache hit rate: 0.0%
(APIServer pid=22840) INFO 01-03 17:22:08 [metrics.py:100] SpecDecoding metrics: Mean acceptance length: 1.76, Accepted throughput: 29.50 tokens/s, Drafted throughput: 38.90 tokens/s, Accepted: 295 tokens, Drafted: 389 tokens, Per-position acceptance rate: 0.758, Avg Draft acceptance rate: 75.8%
(APIServer pid=22840) INFO 01-03 17:22:18 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 66.7 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.8%, Prefix cache hit rate: 0.0%
(APIServer pid=22840) INFO 01-03 17:22:18 [metrics.py:100] SpecDecoding metrics: Mean acceptance length: 1.71, Accepted throughput: 27.70 tokens/s, Drafted throughput: 39.00 tokens/s, Accepted: 277 tokens, Drafted: 390 tokens, Per-position acceptance rate: 0.710, Avg Draft acceptance rate: 71.0%
(APIServer pid=22840) INFO 01-03 17:22:28 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 70.0 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 1.1%, Prefix cache hit rate: 0.0%
(APIServer pid=22840) INFO 01-03 17:22:28 [metrics.py:100] SpecDecoding metrics: Mean acceptance length: 1.80, Accepted throughput: 31.20 tokens/s, Drafted throughput: 38.79 tokens/s, Accepted: 312 tokens, Drafted: 388 tokens, Per-position acceptance rate: 0.804, Avg Draft acceptance rate: 80.4%
I ran a test on my 8x4090(48GB) rig:
will i be able to run it on 2X6000 Pro (192 GB total VRAM) rig?
How do i quantize it to be under that ram?
i think not possible with vllm ,192GB not enough to cover the weight let alone kv cache for context? GLM 4.6V is best bet imho.
I ran a test on my 8x4090(48GB) rig:
will i be able to run it on 2X6000 Pro (192 GB total VRAM) rig?
How do i quantize it to be under that ram?
i think not possible with vllm ,192GB not enough to cover the weight let alone kv cache for context? GLM 4.6V is best bet imho.
Edit: sorry! Thought it was m2.1 :)
haha i am about to ask how you managed to do that as i am struggling a little with 256GB vram . get 1 more pro and run pp hehehe . is a damn good model.