40% faster!

by bobig - opened Sep 12, 2025

Sep 12, 2025

Wow!
100 TPS: ernie-4.5-21b-a3b-pt-mls by mlx-community
140 TPS: nightmedia/ERNIE-4.5-21B-A3B-Thinking-mxfp4-mlx by Mr Nightmedia

It does seem to default to Chinese, easily fixed with "answer in English" or adjusting the Jinja template.

nightmedia

Owner Sep 12, 2025

•

edited Sep 12, 2025

Thanks for this--it does not happen to me, so it must be context specific.

I am uploading for test a qx64-hi version of Ernie to see if it still has the language issue. I suspect there is some loss with mxfp4 that can't be avoided. The qx64-hi is only 10% slower on my machine. That's a 4 bit model with 6 bit attention paths, 8 bit head, all with group size 32.

I have to check out and test the unsloth version of Ernie, they usually fix stuff, and that might be the issue.

I will compare the jinja templates and maybe that can be fixed in-place

edit: nope, that wasn't it. Weird. I am looking at the traces, Ernie is not that bad at qx86-hi, and that clocks at 86 tok/sec on my box. Let me know if you want to try the high-res and I will upload the qx86-hi as well.

I did not know whether the qx quants would work on Ernie at all, it has a different geometry than Qwens, but it seems they do.

The one thing that I noticed weird about Ernie, it struggles to deliver 8k tokens, like its life depends on it, including the mxfp4. This affects the output, but the qx64-hi doesn't have this restraint, and codes away twice the amount of output

bobig

Sep 15, 2025

Yes, Ernie thinking MXFP4 is very fast, but the speed gains are lost on thinking.
A little dummy overthinking without enough world knowledge.

Your excellent quant of qwen3-30b-a3b-yoyo-v2 in MXFP4 is a little slower but worth the wait.

My favorite is still: unsloth-glm-4.5-air-qx64-mlx
You got some magic in there.

nightmedia

Owner Sep 15, 2025

•

edited Sep 15, 2025

Thank you very much, I am not the only one that sees it :)

I was considering writing a paper on this, but everybody does that nowadays and it turns out to be a dud, so why bother

The magic here was to "copy" a photo lens design, the Nikon Noct 58mm F/0.95 and apply it to an LLM( the Deckard formula = qx)

The idea, once it showed in metrics significant, stable, and repeatable results, was refined and adapted to MoE and dense models to "fit"

I was quite surprised to see it work on GLM Air

I am applying it now to the Qwen3-80B series, slightly modified, the 80B has a different taste for bits

bobig

Sep 16, 2025

Wow, very creative way to look at the attention problem, using a classic camera's optics for inspiration, focus.

Qwen30-80B doesn't understand :-D
Unsloth-glm-4.5-air-qx64-mlx....Gets it and explains it well.

FYI, Ernie is useful for quick answers because it is less censored than Qwen.

nightmedia

Owner Sep 16, 2025

I just did a quant series of the YOYO V3 and it has "auto-think mode", works fairly well.

Not as strong as the V2 but it seems excellent for RP, as it "falls in character" incredibly well.

The 80B qx3 does have a sense of humour though. My qx formula does:

I asked:

"What if Qwen would release an 80B sized MoE model, what do you think the acceptance would be for this size? Would it be appropriately called LizzoCoder because it doesn't fit in one seat (64 GB Mac)?"

✅ Verdict: Yes, “LizzoCoder” would absolutely go viral as a meme, and if Qwen officially used it as a codename? They’d be geniuses.

🎤 “It’s not a 64GB Mac — it’s a LizzoCoder. And honey, she slays.”

🏁 Final Answer
An 80B MoE from Qwen would be a landmark release — widely adopted, technically impressive, and culturally iconic.

And yes — “LizzoCoder” is not just a joke… it’s the perfect nickname.
Let’s hope Qwen names it that — or at least lets us call it that. 🎶💻

(P.S. If they do, I’ll start a fan club.)
Generation: 28.499 tokens-per-sec
Peak memory: 45.166 GB

At least the qx3 does fit on a 64GB Mac, barely.

bobig

Sep 17, 2025

qwen3-next-80b-MXFP4 is 42GB, should fit with lots of room for context.

Have you tried adjusting the Mac Memory limit:
sudo sysctl iogpu.wired_limit_mb=500000

The Unsloth GLM 4.5 IQ2_XXS is 116GB
runs well in Mac M4 128GB
about 20 tokesn per second
K & V cache set to Q8 helps with memory.

nightmedia

Owner Sep 17, 2025

•

edited Sep 17, 2025

Yeah, I usually do, but that's really on the edge, and you have enough context for "one good question"

I am experimenting now carefully with different layer combinations for the 80B. It's slow, but now and then you hit on something very nice. Not especially smart, but nice. An entertaining quant.

I noticed so far that the qx formula carries pretty well to the 80B, with probably a few adjustments I'd have to make to a few key layers. I waited for the tools to stabilize before I experiment, but in the long run I wanted to get to a similar formula to the gpt-oss, where data is mxfp4, and everything else is customizable with fix bit.

It turns out that works really well in the recent quant I made of the unsloth-gpt-oss-120b, in qx86-mxfp4. This preserved the mxfp4 layers as they were and only quanted the attention paths. So, essentially, Deckard is now alive in GPT-OSS-120B, needs to be tested and the metrics validated.

Once I have those results, I will be happy. Either way

bobig

Sep 18, 2025

•

edited Sep 18, 2025

I am not sure about Qwen 80B in any quant, I did notice the MXFP4 and QX64 run about the same TPS, yet the QX64 is 10GB bigger, curious.
Mostly I prefer your delicious Unsloth-glm-4.5-air-qx64-mlx for serious stuff, the gold standard. I hope more Big Mac users find your best work.

FYI
this model seems useful and running over 250 TPS.
https://huggingface.co/mlx-community/Ling-mini-2.0-4bit

nightmedia

Owner Sep 18, 2025

•

edited Sep 18, 2025

Ha, I quanted it, but never made it in the upload queue :)

So many new things. Now the Ling-flash-2.0 came out too, but their layer definition is messed up and I can't quant it. I'll wait till they fix it.

The mix of qx64-mxfp4 in the gpt-oss turned it into a poet. Another one.

The 80B are definitely "tuned to be social". All quants are very friendly, there is a lot of secondary inference that just makes sure the user is happy. Some people don't like so much sugar, and probably that's why. In coding, I agree with you. Air is supreme.

update: I checked the Ring/Ling again, MLX tools have a beef with their layer definitions. The quants I thought I had were broken. Too tired to look at it today, will check it out tomorrow

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment