Doing inference with a Mac Mini to save money is more or less holding it wrong. ...

mythz · 2026-02-11T16:02:39 1770825759

> Doing inference with a Mac Mini to save money is more or less holding it wrong.

No one's running these large models on a Mac Mini.

> Of course if you buy some overpriced Apple hardware it’s going to take years to break even.

Great, where can I find cheaper hardware that can run GLM 5's 745B or Kimi K2.5 1T models? Currently it requires 2x M3 Ultras (1TB VRAM) to run Kimi K2.5 at 24 tok/s [1] What are the better value alternatives?

[1] https://x.com/alexocheema/status/2016404573917683754

Gracana · 2026-02-11T20:35:43 1770842143

Six months ago I'd have said EPYC Turin. You could do a heck of a build with 12Ch DDR5-6400 and a GPU or two for the dense model parts. 20k would have been a huge budget for a homelab CPU/GPU inference rig at the time. Now 20k won't buy you the memory.

mythz · 2026-02-12T01:03:53 1770858233

Not VRAM? What performance are people getting running GLM or Kimi on DDR5?

Gracana · 2026-02-12T18:06:49 1770919609

It's important to have enough VRAM to get the kv cache and shared trunk of the model on GPU, but beyond that it's really hard to make a dent in the pool of 100s of gigabytes of experts.

I wish I had better numbers to compare with the 2x M3 Ultra setup. My system is a few RTX A4000s on a Xeon with 190GB/s actual read bandwidth, and I get ~8 tok/s with experts quantized to INT4 (for large models with around 30B active parameters like Kimi K2.) Moving to 1x RTX Pro 6000 Blackwell and tripling my read bandwidth with EPYC Turin might make it competitive with the the macs, but I dunno!

There's also some interesting tech with ktransformers + sglang where the most frequently-used experts are loaded on GPU. Pretty neat stuff and it's all moving fast.

Gracana · 2026-02-14T19:41:09 1771098069

There's a reddit comment here https://www.reddit.com/r/LocalLLaMA/comments/1r4m4it/comment... that says:

my system is running GLM-5 MXFP4 at about 17 tok/s. That’s with a single RTX Pro 6000 on an EPYC 9455P with 12 channels of DDR5-6400. Only 16k context though, since it’s too slow to use for programming anyway and that’s the only application where I need big context.