> The 1.8-bit (UD-TQ1_0) quant will run on a single 24GB GPU if you offload all MoE layers to system RAM (or a fast SSD). With ~256GB RAM, expect ~10 tokens/s. The full Kimi K2.5 model is 630GB and typically requires at least 4× H200 GPUs.
If the model fits, you will get >40 tokens/s when using a B200.
To run the model in near full precision, you can use the 4-bit or 5-bit quants. You can use any higher just to be safe.
For strong performance, aim for >240GB of unified memory (or combined RAM+VRAM) to reach 10+ tokens/s. If you’re below that, it'll work but speed will drop (llama.cpp can still run via mmap/disk offload) and may fall from ~10 tokens/s to <2 token/s.
We recommend UD-Q2_K_XL (375GB) as a good size/quality balance. Best rule of thumb: RAM+VRAM ≈ the quant size; otherwise it’ll still work, just slower due to offloading.
I'm running the Q4_K_M quant on a xeon with 7x A4000s and I'm getting about 8 tok/s with small context (16k). I need to do more tuning, I think I can get more out of it, but it's never gonna be fast on this suboptimal machine.
you can add 1 more GPU so you can take advantage of tensor parallel. I get the same speed with 5 3090's with most of the model on 2400mhz ddr4 ram, 8.5tk almost constant. I don't really do agents but chat, and it holds up to 64k.
That is a very good point and I would love to do it, but I built this machine in a desktop case and the motherboard has seven slots. I did a custom water cooling manifold just to make it work with all the cards.
I'm trying to figure out how to add another card on a riser hanging off a slimsas port, or maybe I could turn the bottom slot into two vertical slots.. the case (fractal meshify 2 xl) has room for a vertical mounted card that wouldn't interfere with the others, but I'd need to make a custom riser with two slots on it to make it work. I dunno, it's possible!
I also have an RTX Pro 6000 Blackwell and an RTX 5000 Ada.. I'd be better off pulling all the A7000s and throwing both of those cards in this machine, but then I wouldn't have anything for my desktop. Decisions, decisions!
Both making and removing regulation boosts my business, as my clients care about changes. That said, I assure you that one regulation getting made out of millions has no effect on my bottom line.
The drone thing is a personal opinion. If the US ends up in a war (whether it’s one I agree with or not, likely not), I don’t want millions of drones to be remote controllable by the folks we’re fighting.
I'm honestly much more worried about the fact that China has access to production lines for zillions of the things than what they'd do with existing ones, but I did make the comment so I'll run with it =).
Let's put on our fun James Bond villain hats for a bit.
The US has around 1.75MM drones that people have bothered to register. DJI has around 75% of that, so call it 1.25MM. This registration program is relatively new so let's say 750K of those are still operable.
How many of those are in the air at any given time? Keep in mind many of these bigger registered drones are used by businesses.
Let's say it's 1%, so 7,500 drones suddenly open some backdoor and get commanded to do a nose dive for the nearest power line. Now add in the smaller ones that are less likely to do damage, but there are 10x as many. Now combine it with a simultaneous cyber attack on infrastructure, and some pre-planned terror attacks.
Is it going to end the country? Of course not. Is there potential for that to cause huge chaos? I think so.
Is that more absurd than the Hezbollah pager bombings? I don't think so.
So yeah, I'd pay more for my drones, my cars, my cell phone towers, etc etc to avoid them being controlled by a country that we might end up in a stupid war with. I'm not saying you can make everything locally in the modern world, that's absurd. But there are valid strategic and natsec concerns about the US/China trade relationship in 2025.
> Is that more absurd than the Hezbollah pager bombings? I don't think so.
OMG, get serious. DJI can't blow up the drones. It is in my closet, not my pocket.
Again, this is just silly. Even for James Bond! ;p
It is more "We have to do something!!" that reminds me of cities in California having moratoriums on building new housing -- who would have thought that people would want to build in order to live in a nice climate. But really was about a new kind of neighbor...
Ratings are very criticized by artists, eg as being fueled by conservative moms. For example, in the USA, movies with guns and explosion can be shown to younger audiences than nudity - seems very illogical.
Also, some anecdotes: lots of my friends were into GTA as kids, ie early teens, and turned out fine. Comparing to kids who didn't do so well, I consider the most important factors to br family, education, and finances, not violent multimedia.
With that being said, I'm sympathetic to limiting internet access due to communication with strangers, and extreme content (eg violent rethorics that appeal to action, not fantasy violence).
Okay. Society isn’t asking you to police how parents choose to parent. Not like this. It is reasonable for someone to want to be able to buy something advertised as having a certain feature without it being implemented with malicious deception. Nobody wants to have the “are bideo games good or bad?” debate again.
So many parental opinions on here. Not every kid is the same. Trying to apply blanket parental strategies speaks of ignorance. I have neurodivergent kids and this could be great for them.
I bought some ebooks from other vendors to avoid lock-in and side-loaded them on my kindle. Last year, if Amazon sold one of these titles it would dissappear if I turned on wifi. I now have a kobo.
I've done live demos of AI. Even with the same queries, I got a different answers than my 4 previous practice attempts. My demos keep me on my toes and I try to limit the scope much more now.
(I didn't have control over temperature settings.)
> (I didn't have control over temperature settings.)
That's...interesting. You'd think they'd dial the temperature to 0 for you before the demo at least. Regardless, if the tech is good, I'd hope all the answers are at least decent and you could roll with it. If not....then maybe it needs to stay in R&D.
Reducing temperature to 0 doesn't make LLMs deterministic. There's still a bunch of other issues such as float math results depending on which order you perform mathematically commutative operations in.
It gets more complicated with things like batch processing. Depending on where in the stack your query gets placed, and how the underlying hardware works, and how the software stack was implemented, you might get small differences that get compounded over many token generations. (vLLM - a popular inference engine, has this problem as well).
Associative property of multiplication breaks down with floating point math because of the error. If the engine is multithreaded then its pretty easy to see how ordering of multiplication can change which can change the output.
For me it was the lack of confirmation with the backend. When it was the next big thing, it sent changes to the backend without waiting for a response. This made the interface crazy fast but I just couldn't take the risk of the FE being out-of-sync with the backend. I hope they grew out of that model but I never took it serious for that one reason.
Yeah I built my first startup on Meteor, and the prototype for my second one, but there was so many weird state bugs after it got more complicated that we had to eventually switch back to normal patterns to scale it.
Requirements are listed.