QuantTrio/GLM-4.7-AWQ · Once again Thanks, here is my review for 8 x RTX 5090 setup

10 days ago

for RTX 5090 setup , I observed slower TPS when enabled MTP
--speculative-config.method mtp
--speculative-config.num_speculative_tokens 1
Avg 25 TPS

With limited Vram of 256GB , I think RTX 5090 users should turn it off for more KV Cache and better performance?

Once I disabled MTP I attained 35-41 TPS (ubuntu GDM, I expect a little more TPS if headless setup)

crystech

10 days ago

---- Without MTP enabled -----
(APIServer pid=21036) INFO: 127.0.0.1:50526 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=21036) INFO 12-25 02:01:12 [loggers.py:257] Engine 000: Avg prompt throughput: 1383.7 tokens/s, Avg generation throughput: 13.8 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 9.1%, Prefix cache hit rate: 0.0%
(APIServer pid=21036) INFO 12-25 02:01:22 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 40.0 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 9.3%, Prefix cache hit rate: 0.0%
(APIServer pid=21036) INFO 12-25 02:01:32 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 41.9 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 9.6%, Prefix cache hit rate: 0.0%

---with MTP enabled ------
(APIServer pid=19990) INFO: Application startup complete.
(APIServer pid=19990) INFO: 127.0.0.1:35818 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=19990) INFO 12-25 01:51:58 [loggers.py:257] Engine 000: Avg prompt throughput: 1260.1 tokens/s, Avg generation throughput: 13.2 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 10.3%, Prefix cache hit rate: 0.0%
(APIServer pid=19990) INFO 12-25 01:51:58 [metrics.py:100] SpecDecoding metrics: Mean acceptance length: 1.89, Accepted throughput: 0.77 tokens/s, Drafted throughput: 0.86 tokens/s, Accepted: 62 tokens, Drafted: 70 tokens, Per-position acceptance rate: 0.886, Avg Draft acceptance rate: 88.6%
(APIServer pid=19990) INFO 12-25 01:52:08 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 27.2 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 10.6%, Prefix cache hit rate: 0.0%
(APIServer pid=19990) INFO 12-25 01:52:08 [metrics.py:100] SpecDecoding metrics: Mean acceptance length: 1.78, Accepted throughput: 11.90 tokens/s, Drafted throughput: 15.30 tokens/s, Accepted: 119 tokens, Drafted: 153 tokens, Per-position acceptance rate: 0.778, Avg Draft acceptance rate: 77.8%
(APIServer pid=19990) INFO 12-25 01:52:18 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 27.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 10.8%, Prefix cache hit rate: 0.0%
(APIServer pid=19990) INFO 12-25 01:52:18 [metrics.py:100] SpecDecoding metrics: Mean acceptance length: 1.78, Accepted throughput: 11.90 tokens/s, Drafted throughput: 15.20 tokens/s, Accepted: 119 tokens, Drafted: 152 tokens, Per-position acceptance rate: 0.783, Avg Draft acceptance rate: 78.3%
(APIServer pid=19990) INFO 12-25 01:52:28 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 28.5 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 11.0%, Prefix cache hit rate: 0.0%
(APIServer pid=19990) INFO 12-25 01:52:28 [metrics.py:100] SpecDecoding metrics: Mean acceptance length: 1.79, Accepted throughput: 12.60 tokens/s, Drafted throughput: 15.90 tokens/s, Accepted: 126 tokens, Drafted: 159 tokens, Per-position acceptance rate: 0.792, Avg Draft acceptance rate: 79.2%
(APIServer pid=19990) INFO 12-25 01:52:38 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.6 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=19990) INFO 12-25 01:52:38 [metrics.py:100] SpecDecoding metrics: Mean acceptance length: 2.00, Accepted throughput: 0.30 tokens/s, Drafted throughput: 0.30 tokens/s, Accepted: 3 tokens, Drafted: 3 tokens, Per-position acceptance rate: 1.000, Avg Draft acceptance rate: 100.0%
(APIServer pid=19990) INFO 12-25 01:52:48 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%

am i doing anything wrong? just in case

tclf90

QuantTrio org 10 days ago

emmm I guess you would have to let go "export VLLM_USE_DEEP_GEMM=0", and give it a try. But I'm not sure.

tclf90

QuantTrio org 10 days ago

I ran a test on my 8x4090(48GB) rig:

---- w/o MTP enabled -----

(APIServer pid=143220) INFO 12-25 15:24:40 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 53.9 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.2%, Prefix cache hit rate: 0.0%
(APIServer pid=143220) INFO 12-25 15:24:50 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 54.0 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.3%, Prefix cache hit rate: 0.0%
(APIServer pid=143220) INFO 12-25 15:25:00 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 53.9 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.4%, Prefix cache hit rate: 0.0%

---with MTP enabled ------

(APIServer pid=136891) INFO 12-25 15:17:58 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 73.7 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.2%, Prefix cache hit rate: 0.0%
(APIServer pid=136891) INFO 12-25 15:17:58 [metrics.py:100] SpecDecoding metrics: Mean acceptance length: 1.66, Accepted throughput: 29.30 tokens/s, Drafted throughput: 44.39 tokens/s, Accepted: 293 tokens, Drafted: 444 tokens, Per-position acceptance rate: 0.660, Avg Draft acceptance rate: 66.0%
(APIServer pid=136891) INFO 12-25 15:18:08 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 66.6 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.4%, Prefix cache hit rate: 0.0%
(APIServer pid=136891) INFO 12-25 15:18:08 [metrics.py:100] SpecDecoding metrics: Mean acceptance length: 1.63, Accepted throughput: 25.80 tokens/s, Drafted throughput: 40.80 tokens/s, Accepted: 258 tokens, Drafted: 408 tokens, Per-position acceptance rate: 0.632, Avg Draft acceptance rate: 63.2%
(APIServer pid=136891) INFO 12-25 15:18:18 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 62.7 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.5%, Prefix cache hit rate: 0.0%
(APIServer pid=136891) INFO 12-25 15:18:18 [metrics.py:100] SpecDecoding metrics: Mean acceptance length: 1.65, Accepted throughput: 24.60 tokens/s, Drafted throughput: 38.10 tokens/s, Accepted: 246 tokens, Drafted: 381 tokens, Per-position acceptance rate: 0.646, Avg Draft acceptance rate: 64.6%
(APIServer pid=136891) INFO 12-25 15:18:28 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 58.6 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.7%, Prefix cache hit rate: 0.0%

My results look reasonable, for reference.

crystech

10 days ago

Thanks, tried and still worst. i will disable MTP for my setup. not sure what causes that to slow down

mratsim

10 days ago

•

edited 10 days ago

Thanks, tried and still worst. i will disable MTP for my setup. not sure what causes that to slow down

I had the same issue with speculative decoding on GLM-4.5V-FP8 (official quant) with RTX Pro 6000: https://github.com/vllm-project/vllm/issues/26838#issuecomment-3563172299 so I'm not sure if it's the quant itself that's problematic.

Edit: Ah, but in this specific unofficial quant case, a MTP layer would predict what the unquantized or the FP8 models would output, it would need to be retrained for a quant and obviously this is quite complex. So for any quant, the MTP layer likely should be stripped or mispredictions would lead to more work.

tclf90

QuantTrio org 10 days ago

Thanks, tried and still worst. i will disable MTP for my setup. not sure what causes that to slow down

in your case, the most prominent issue is, your 8x5090 runs slower than my 8x4090, even without mtp enabled...

my rig runs on 53.9 tokens/s, but yours 40.0 tokens/s

tclf90

QuantTrio org 10 days ago

Thanks, tried and still worst. i will disable MTP for my setup. not sure what causes that to slow down

I had the same issue with speculative decoding on GLM-4.5V-FP8 (official quant) with RTX Pro 6000: https://github.com/vllm-project/vllm/issues/26838#issuecomment-3563172299 so I'm not sure if it's the quant itself that's problematic.

Edit: Ah, but in this specific unofficial quant case, a MTP layer would predict what the unquantized or the FP8 models would output, it would need to be retrained for a quant and obviously this is quite complex. So for any quant, the MTP layer likely should be stripped or mispredictions would lead to more work.

"mispredictions would lead to more work." at the least, this is not the case, as you can see from Avg Draft acceptance rate, which is comparable to that of the bf16 version.

mratsim

10 days ago

"mispredictions would lead to more work." at the least, this is not the case, as you can see from Avg Draft acceptance rate, which is comparable to that of the bf16 version.

Ah good catch

mtcl

9 days ago

I ran a test on my 8x4090(48GB) rig:

will i be able to run it on 2X6000 Pro (192 GB total VRAM) rig?

How do i quantize it to be under that ram?

crystech

about 19 hours ago

Thanks, tried and still worst. i will disable MTP for my setup. not sure what causes that to slow down

in your case, the most prominent issue is, your 8x5090 runs slower than my 8x4090, even without mtp enabled...

my rig runs on 53.9 tokens/s, but yours 40.0 tokens/s

i re-setup my rig with proper chassis , 8 x PCIe properly connected
update latest vllm nightly
now it seem hitting 40- 50 tps , i think vllm is not yet fully optimized for RTX 5090 but the speed and quality seem good
speculative coding also works for me now hitting up to 60-70tps. not as good but happy enough

Tried tp =8 ,going to

just like to ask:
when you train this model , did you expose the thinking together with outputs? it would be nice to conceal thinking though

latest result
(APIServer pid=22840) INFO 01-03 17:19:18 [loggers.py:257] Engine 000: Avg prompt throughput: 264.8 tokens/s, Avg generation throughput: 37.0 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 1.3%, Prefix cache hit rate: 0.0%
(APIServer pid=22840) INFO 01-03 17:19:18 [metrics.py:100] SpecDecoding metrics: Mean acceptance length: 1.71, Accepted throughput: 3.02 tokens/s, Drafted throughput: 4.24 tokens/s, Accepted: 154 tokens, Drafted: 216 tokens, Per-position acceptance rate: 0.713, Avg Draft acceptance rate: 71.3%
(APIServer pid=22840) INFO 01-03 17:19:28 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 65.0 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 1.5%, Prefix cache hit rate: 0.0%
(APIServer pid=22840) INFO 01-03 17:19:28 [metrics.py:100] SpecDecoding metrics: Mean acceptance length: 1.67, Accepted throughput: 26.10 tokens/s, Drafted throughput: 38.90 tokens/s, Accepted: 261 tokens, Drafted: 389 tokens, Per-position acceptance rate: 0.671, Avg Draft acceptance rate: 67.1%
(APIServer pid=22840) INFO 01-03 17:19:38 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 63.8 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 1.8%, Prefix cache hit rate: 0.0%
(APIServer pid=22840) INFO 01-03 17:19:38 [metrics.py:100] SpecDecoding metrics: Mean acceptance length: 1.64, Accepted throughput: 24.90 tokens/s, Drafted throughput: 38.90 tokens/s, Accepted: 249 tokens, Drafted: 389 tokens, Per-position acceptance rate: 0.640, Avg Draft acceptance rate: 64.0%
(APIServer pid=22840) INFO 01-03 17:19:48 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 64.2 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 2.1%, Prefix cache hit rate: 0.0%
(APIServer pid=22840) INFO 01-03 17:19:48 [metrics.py:100] SpecDecoding metrics: Mean acceptance length: 1.65, Accepted throughput: 25.40 tokens/s, Drafted throughput: 38.80 tokens/s, Accepted: 254 tokens, Drafted: 388 tokens, Per-position acceptance rate: 0.655, Avg Draft acceptance rate: 65.5%
(APIServer pid=22840) INFO 01-03 17:19:58 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 63.8 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 2.3%, Prefix cache hit rate: 0.0%
(APIServer pid=22840) INFO 01-03 17:19:58 [metrics.py:100] SpecDecoding metrics: Mean acceptance length: 1.64, Accepted throughput: 24.90 tokens/s, Drafted throughput: 38.90 tokens/s, Accepted: 249 tokens, Drafted: 389 tokens, Per-position acceptance rate: 0.640, Avg Draft acceptance rate: 64.0%
(APIServer pid=22840) INFO 01-03 17:20:08 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 64.2 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 2.6%, Prefix cache hit rate: 0.0%
(APIServer pid=22840) INFO 01-03 17:20:08 [metrics.py:100] SpecDecoding metrics: Mean acceptance length: 1.65, Accepted throughput: 25.40 tokens/s, Drafted throughput: 38.80 tokens/s, Accepted: 254 tokens, Drafted: 388 tokens, Per-position acceptance rate: 0.655, Avg Draft acceptance rate: 65.5%
(APIServer pid=22840) INFO: 172.17.0.2:60184 - "GET /v1/models HTTP/1.1" 200 OK
(APIServer pid=22840) INFO 01-03 17:20:18 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 51.5 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.8%, Prefix cache hit rate: 0.0%
(APIServer pid=22840) INFO 01-03 17:20:18 [metrics.py:100] SpecDecoding metrics: Mean acceptance length: 1.65, Accepted throughput: 20.30 tokens/s, Drafted throughput: 31.20 tokens/s, Accepted: 203 tokens, Drafted: 312 tokens, Per-position acceptance rate: 0.651, Avg Draft acceptance rate: 65.1%
(APIServer pid=22840) INFO 01-03 17:20:28 [loggers.py:257] Engine 000: Avg prompt throughput: 695.1 tokens/s, Avg generation throughput: 46.3 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 3.1%, Prefix cache hit rate: 0.0%
(APIServer pid=22840) INFO 01-03 17:20:28 [metrics.py:100] SpecDecoding metrics: Mean acceptance length: 1.65, Accepted throughput: 18.30 tokens/s, Drafted throughput: 28.00 tokens/s, Accepted: 183 tokens, Drafted: 280 tokens, Per-position acceptance rate: 0.654, Avg Draft acceptance rate: 65.4%
(APIServer pid=22840) INFO 01-03 17:20:38 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 67.3 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 3.4%, Prefix cache hit rate: 0.0%
(APIServer pid=22840) INFO 01-03 17:20:38 [metrics.py:100] SpecDecoding metrics: Mean acceptance length: 1.74, Accepted throughput: 28.70 tokens/s, Drafted throughput: 38.60 tokens/s, Accepted: 287 tokens, Drafted: 386 tokens, Per-position acceptance rate: 0.744, Avg Draft acceptance rate: 74.4%
(APIServer pid=22840) INFO: 172.17.0.2:60198 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=22840) INFO 01-03 17:20:48 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 4.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=22840) INFO 01-03 17:20:48 [metrics.py:100] SpecDecoding metrics: Mean acceptance length: 1.82, Accepted throughput: 1.80 tokens/s, Drafted throughput: 2.20 tokens/s, Accepted: 18 tokens, Drafted: 22 tokens, Per-position acceptance rate: 0.818, Avg Draft acceptance rate: 81.8%
(APIServer pid=22840) INFO 01-03 17:20:58 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=22840) INFO: 172.17.0.2:56364 - "GET /v1/models HTTP/1.1" 200 OK
(APIServer pid=22840) INFO: 172.17.0.2:56372 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=22840) INFO 01-03 17:21:58 [loggers.py:257] Engine 000: Avg prompt throughput: 2.8 tokens/s, Avg generation throughput: 62.9 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.3%, Prefix cache hit rate: 0.0%
(APIServer pid=22840) INFO 01-03 17:21:58 [metrics.py:100] SpecDecoding metrics: Mean acceptance length: 1.77, Accepted throughput: 3.90 tokens/s, Drafted throughput: 5.07 tokens/s, Accepted: 273 tokens, Drafted: 355 tokens, Per-position acceptance rate: 0.769, Avg Draft acceptance rate: 76.9%
(APIServer pid=22840) INFO 01-03 17:22:08 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 68.4 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.6%, Prefix cache hit rate: 0.0%
(APIServer pid=22840) INFO 01-03 17:22:08 [metrics.py:100] SpecDecoding metrics: Mean acceptance length: 1.76, Accepted throughput: 29.50 tokens/s, Drafted throughput: 38.90 tokens/s, Accepted: 295 tokens, Drafted: 389 tokens, Per-position acceptance rate: 0.758, Avg Draft acceptance rate: 75.8%
(APIServer pid=22840) INFO 01-03 17:22:18 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 66.7 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.8%, Prefix cache hit rate: 0.0%
(APIServer pid=22840) INFO 01-03 17:22:18 [metrics.py:100] SpecDecoding metrics: Mean acceptance length: 1.71, Accepted throughput: 27.70 tokens/s, Drafted throughput: 39.00 tokens/s, Accepted: 277 tokens, Drafted: 390 tokens, Per-position acceptance rate: 0.710, Avg Draft acceptance rate: 71.0%
(APIServer pid=22840) INFO 01-03 17:22:28 [loggers.py:257] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 70.0 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 1.1%, Prefix cache hit rate: 0.0%
(APIServer pid=22840) INFO 01-03 17:22:28 [metrics.py:100] SpecDecoding metrics: Mean acceptance length: 1.80, Accepted throughput: 31.20 tokens/s, Drafted throughput: 38.79 tokens/s, Accepted: 312 tokens, Drafted: 388 tokens, Per-position acceptance rate: 0.804, Avg Draft acceptance rate: 80.4%

crystech

about 19 hours ago

I ran a test on my 8x4090(48GB) rig:

will i be able to run it on 2X6000 Pro (192 GB total VRAM) rig?

How do i quantize it to be under that ram?

i think not possible with vllm ,192GB not enough to cover the weight let alone kv cache for context? GLM 4.6V is best bet imho.

mtcl

about 18 hours ago

•

edited about 18 hours ago

I ran a test on my 8x4090(48GB) rig:

will i be able to run it on 2X6000 Pro (192 GB total VRAM) rig?

How do i quantize it to be under that ram?

i think not possible with vllm ,192GB not enough to cover the weight let alone kv cache for context? GLM 4.6V is best bet imho.

Edit: sorry! Thought it was m2.1 :)

crystech

about 13 hours ago

haha i am about to ask how you managed to do that as i am struggling a little with 256GB vram . get 1 more pro and run pp hehehe . is a damn good model.