Hacker Newsnew | past | comments | ask | show | jobs | submit | _345's commentslogin

It's a seriously degraded experience from a developer's perspective. Okay you've got one local LLM installed finally after configuring everything perfectly, what happens when you want to run a second instance? Now you've blown past your vram and system ram limits, and you're stuck to just one.

Furthermore, the model they recommend doesn't quite reach ~gpt-5.4-mini level performance- that quality dip means you may as well just pay for something like Kimi K2.6 via openrouter if you want a something ~>= sonnet 4.6 in performance as a backup for when you run out of anthropic/openai usage.


Your point about caliber/quality is fair, but I have been pretty astonished by some of the newer/better models (Gemma 4 variants, GPT-OSS before that).

However, there's not a lot of memory increase to have multiple sessions in parallel with one model. It's an HTTP server, and other than some caching, basically stateless.


Doesn't llama.cpp (or similar) have to evict the kv cache for this, so that performance is degraded when running multiple sessions? Or how do you load a model in memory and then use it in multiple sessions? I am still learning this stuff

The model is loaded once and can be used for multiple sessions, and even parallel requests.

llama.cpp uses a unified KV cache that is shared between requests (be they happening in parallel or not). As new requests come in, they'll evict no longer referenced branches, then move to evict the least recently used entry, and so on.

If you come back to a session that's been evicted it will just be parsed again. This is a problem only on very long context sessions, but it can still be a problem to you.

So one way to reduce such evictions (and reduce KV cache size significantly as a bonus) is to reduce the number of kv cache checkpoints.

Checkpoints allow you to branch a session at any point and not have to recompute it from the start. If you find that you rarely branch a conversation, or if you rely entirely on a coding harness, then setting ctx-checkpoints to 0 or 1 will save tons of VRAM and allow more different sessions to stay in VRAM. This is especially true for models with very large checkpoints (such as Gemma 4).


There are so many flags to llama.ccp that I won't try to say anything too strong, but I believe things related to `--kv-offload` mean you can have the KV cache in GPU VRAM, regular GPU RAM, paged to disk, etc...

I'm on a Mac with unified memory, so I can't easily benchmark it for you, but I think a PC with 64GB of regular RAM and a 24GB gaming card could swap between multiple sessions without too much pain. The weights could stay resident on the GPU.

On the other hand, I did just dump some Project Gutenberg texts into a prompt, and building that cache in the first place was slower than I though it would be.


Why are you running 2 instances anyways? If you want that workflow just rent a few ec2 gpu instances and fire away?

If you're going to rent a few ec2 gpu instances you might as well funnel things through openrouter. Not that many of us have workflows where trusting an LLM provider is a problem but sending the data to EC2 is not.

As for why, why would you not? Sitting around waiting for a single assistant is inefficient use of time; I tend to have more like 4-10 instances running in parallel.


> Not that many of us have workflows where trusting an LLM provider is a problem but sending the data to EC2 is not.

I'd imagine plenty of people have a problem with trusting fly-by-night inference providers or model owners with opt-out policies [1] [2] about training on your data, who would be more than happy to send data to EC2, or even the same models in Amazon Bedrock.

[1]: https://github.blog/news-insights/company-news/updates-to-gi...

[2]: https://help.openai.com/en/articles/5722486-how-your-data-is...


I absolutely see no reason to send company IP, future plans, and current code base to any other company.

I also do not run 10 agents at the same time. There's no way I could keep up with the volume of work from doing that in any meaningful way


Nobody wants or needs your company IP, future plans, and current code base.

You don't run 10 agents to get more volume of work. You run 10 agents to get better quality work


Does your company self host everything though. Many are already in the cloud, why single out llms to not use cloud for

I trust most cloud providers more than most LLMs providers but I still don't trust them much. Anything I can keep safeguarded on premises I do.

I understand that most of the cloud providers run the llms on their own infra, like AWS Bedrock https://aws.amazon.com/bedrock/pricing/

Not sure why you got downvoted. 95% of people should be paying for a subscription. It's far cheaper, far more scalable, and far less hassle.

Local AI only makes sense for a couple of use cases:

  - Privacy
  - Constant churning on tokens
  - Latency
  - Availability
Local AI is "cheaper" when you already have the hardware sitting around, like an old MacBook or gaming GPU, or the API cost (subscriptions will all run out if you churn 24/7) is too high to bare. I'm surprised companies are still selling their old MacBooks to employees, when they could be turning them into Beowulf clusters for cheap AI compute on long-running jobs (the cost is just electricity)

If usage-based pricing is killing your vibe, find a cheaper subscription with higher limits. Here's a list of them compared on price-per-request-limit: https://codeberg.org/mutablecc/calculate-ai-cost/src/branch/...


I think you're right about the cost/benefit trade-off in general, but I do wonder how much "compaction" Codex and Claude do is to keep context fresh and how much is to save (them) runtime costs.

If you've got a 1M token context, but they constantly summarize it down to something much smaller, is it really 1M tokens of benefit? With a local model, you can use all 256k tokens on your own terms. However, I don't have any benchmarks to know.


I think you might be confused a bit about compaction? The LLM API endpoint does not do compaction, it's an external agent harness that does it. And the Codex/Claude agents aren't constantly summarizing it down, they generally wait until you get within 3/4 of the max of the context size.

Compaction doesn't save them money, it just makes it possible for you to continue a session. If you compact a session too many times, besides the fact that the model basically stops being useful, you eventually just cannot do anything else in the session because all the context is taken up by compaction notes. But if you don't compact it, pretty soon the session is completely unusable because it can't output any more tokens. You can disable compaction in those agents if you want to see the difference.

Also, using a lot of context can make the model perform poorly, so compaction can improve results. If you have a much larger context size, it means you have more headroom before the model starts to perform poorly (as it grows closer to max context size). A larger context also lets you do things like handle larger documents or reason over a larger amount of data without having to break it up into subtasks. Eventually we want models' context to get much bigger so we can do more things in a session. (Some research is being done to see if we can get rid of the limit entirely)


LLM API endpoint does do compaction. OpenAI definitely does support serverside compaction, both explicit and automatic, and this is different than what could be implemented purely clientside: https://developers.openai.com/api/docs/guides/compaction (and there was rumors a few months ago on HN about how activation-preserving/latent it is, vs just summarization). Anthropic as well, in beta (new to me): https://platform.claude.com/docs/en/build-with-claude/compac...

The names for the pieces are confusing, so it's easy to talk past each other. For instance, you're saying "Codex the agent", which isn't a thing now. It's currently GPT-5.5, and at one point it was GPT-5.3-Codex, so when I say "Codex", I meant the MacOS "harness". Similar for Claude Code vs Claude Opus/Sonnet.

Anyways, I don't know specifics well enough to argue with you on anything, but there is a cost for input tokens, and you see/pay it when you use the API directly or through OpenRouter. Maybe you looked at the leaked source for the Claude Code and can tell me definitively otherwise, but Anthropics and OpenAI's incentives for when to compact are not always aligned with the users depending on pricing plans.


I recently set up a Gemma 4 heretic fine tune on my MacBook to prove that I could more than anything else and it is probably around 4o levels of performance imo. Not fit for any real work. That said the fact that 4o was frontier two years ago and today I can equal it on local hardware and uncensored is pretty impressive.

> 95% of people should be paying for a subscription.

Subscription plans are the "first hit is free" plans. Real pricing once subscriptions are phased out in a year or two is gonna be orders of magnitude more.


Actually subscription plans will be here indefinitely. The cost of inference will only go down over time, and subscriptions are the end-game for all businesses as it's recurring revenue. Most subscribers don't use all the capacity, and there are limits imposed, so the financials work out. Same basic model as residential internet & mobile phones, but cheaper because there's an order of magnitude (or two) less support and maintenance.

There's no reason to buy a subscription instead of just paying per-token.

(With internet and phones there's a cost of leasing the channel you're using, so there are built-in subscription costs in any case.)


you e gott token addiction.

I've been experimenting with Hermes, I'm convinced hermes is also just bad. Like as a harness it has got to be doing something to lobotomize these models- Even GPT-5.4 performs badly in Hermes vs just using it in Codex.

If you're okay with sonnet level performance, this sounds like a straight upgrade. But I find that sonnet messes up too much, that it ends up not being worth cost optimizing down to using it or another sonnet-level model. Glad to have this as an option though

A lot of people are having good experiences doing things like using opus for designing and using locally hosted qwen3.6 for implementation.

I could see a serious cost reduction story by using opus for design and deepseek for implementation.

Personally I would avoid anthropic entirely. But I get why people don't.


Like me: that’s what I do. Either Opus 4.7 or GLM 5.1 for planning, write it out to a markdown file, then farm it out to Qwen 3.6 27B on my DGX Spark-alike using Pi. Works amusingly well all things considered.

How are you interacting with GLM 5.1? Via the Claude Code harness? I really wish they'd release a fully multimodal model already.

Through Pi, mostly! Also my own for-fun agent I wrote

Yeah so would I, I do miss having vision tools sadly.


How is glm 5.1? I have t tried it yet but have been meaning too

It's surprisingly good. Beats MiniMax 2.7 and Qwen 3.5 Plus in my testing (I haven't tested 3.6 plus though), quite handily. It's far better than Sonnet, and often equivalent to Opus for the web development and OCaml tasks I'm using it for. It definitely isn't Opus 4.7, but its far good enough to earn it's keep and is substantially cheaper.

I agree with this. And also: it uses more thinking time to reach this. So while you get a lot of tokens on their plan, the peak 3x token usage multiplier + the extra thinking means you run into the rate limit anyways.

True, though the $20 equivalent used for planning only I don’t hit those limits often, vs Claude where the Pro can literally hit limits with a single prompt haha

Did you compare it with Kimi K2.6 and DeepSeek V4 Pro? I feel they're similar but as GLM is more expensive, I am not using it much.

I second this, glm-5.1 is incredible.

What hardware are you using to power this?

> DGX Spark-alike

Probably wasn't clear enough if you don't know what that is already, apologies

It's an Asus Ascent GX10, which is a little mini PC with 128GB of LPDDR5X as shared memory for an Nvidia GB10 "Blackwell" (kind of, it's a long story) GPU and a MediaTek ARM CPU


Ah yeah I saw that, I was just curious which particular mini-PC you were using. I was considering picking up one of the various AI Max 395 boxes before the RAMpocalypse but didn't take the plunge. Thanks for the response!

I heavily considered one of the AMD Strix Halo boxes, but part of the reason I wanted this was to learn CUDA :)

pulls up chair

could you tell me the long story?

edit: or wait, is it quasi-Blackwell the way all DGX Sparks are quasi-Blackwell? like the actual silicon is different but it's sorta Blackwell-shaped?


Yeah exactly. Shader model 121 is different to SM 120 (consumer Blackwell) and is different again to data centre Blackwell SM100.

The promise of this chip was “write your code locally, then deploy to the same architecture in the data centre!”

Which is nonsense, because the GB10 is better described as “Hopper with Blackwell characteristics” IMO.

Still great hardware, especially for the price and learning. But we are only just starting to get the kernels written to take advantage of it, and mma.sync is sad compared to tcgen05


I keep re-learning this lesson: I chug along with a lesser model then throw a problem at it that's too complex. Then I try different models until I give up and bring in Opus 4.6 to clean up.

It's not even that much cheaper, GPT 5.5 is about 2x more expensive per task than Deepseek v4 Pro when you adjust for less token usage, according to Artificial Analysis. Doesn't seem worth it to me.

Are we talking pay as you go API or vs plans?

Pay as you go API rates.

And I keep using Opus to like, make git commits. Really just need a smart router that is actually smart, vs having to micromanage model

the problem is managing the contexts. your session might fit in Opus, but will that smaller model you dispatch the git commit to fit? even so, will it eat too much on prefill? do you keep compactions around for this, or RAG before dispatch or something? how do you button back up the response?

all doable but all vaguely squishy and nuanced problems operationally. kinda like harness design in general.


This is the problem: you need the best model, not just a good one, for: - Good architecture, which requires reading specs, code, etc. reads like: lots of tokens in/out - Bug fixing — same, plus logs, e.g. datadog

Once you've found the path, patches are trivial and the savings are tiny unless you're doing refactoring/cleanup.

testing gets more and more complicated. Take a look at opencode go, and you see this:

>Includes GLM-5.1, GLM-5, Kimi K2.5, Kimi K2.6, MiMo-V2-Pro, MiMo-V2-Omni, MiMo->V2.5-Pro, MiMo-V2.5, Qwen3.5 Plus, Qwen3.6 Plus, MiniMax M2.5, MiniMax M2.7, >DeepSeek V4 Pro, and DeepSeek V4 Flash

and now on your own with bugs, all of these models can produce at scale. Am i missing anything in this picture. What is the real use of cheaper models?


I'd argue that you need the model that's good enough, not the best.

any missed bug, any wrong architecture decision, is a huge loss, sure , if you run it as autocomplete on steroids you can get any Chinese model. If you try to move faster, and that is a conscious choice, any hiccup is a productivity loss and tons of tokens burned.

We're not yet at a point of saturation when all the frontier models would be of somewhat comparable "intelligence" and we could decide which to use based on other factors (speed, effective context window etc.), so I honestly don't see why would you (as a company or an employee) not use the best available model with the highest (or at least second highest) thinking effort. The fees are not exactly cheap, but not that expensive either.

Agreed that we're not at saturation, but we don't have a canonical "best" either. For example ChatGPT 5.5 + Codex is, in my experience, vastly superior to Opus 4.7 + Claude Code at sufficiently well-specified Haskell, but equally vastly inferior at correctly inferring my intent. Deepseek may well have its own niche, though I haven't used it enough to guess what it might be.

I don’t find this with sonnet at all. As long as I have a solid Claude.md and periodically review the output and enforce good code practices via basic CI gates I’ve rarely ever found myself having to switch to opus

You might be surprised then at how good cheaper models solve your problems

This has been my experience working on tsz.dev. Only Opus 4.7 and GPT 5.5 can really be productive for the remaining test cases.

> Also why light text on black background?

im really curious how you think this is worse than the other way around


I don’t like “dark mode”. It’s blurry, makes me have to squint and hurts my eyes.

I think I like this article and I haven't finished it yet, but I don't think the bottleneck has shifted to non-human with the advent of agentic AI. It's still the human (deciding what product direction to take, reviewing code, etc)

Yes this is my observations. Task chunking, Sprints, MVPs - A lot of this exists because it makes the human process simpler to go through but now this is not the limitation anymore. Deciding what to do what will drive value was always the most important thing and it now became more critical than ever.

Correct. The constraint is the human’s ability to internalize and make tradeoffs. This will continue to shift and there will be fewer decisions that rely on humans relative to the work being completed, but the humans will still remain the constraint in many types of complex work for some time to come.

And also fuck scrum and purist agile too.


This is eye opening

What does it say for you?

this would work better if gemma 4 actually could tell what it was looking at

The page is hard to read on landscape browsers like Chrome on Win11.


We need more voices like this to cut through the bullshit. It's fine that people want to tinker with local models, but there has been this narrative for too long that you can just buy more ram and run some small to medium sized model and be productive that way. You just can't, a 35b will never perform at the level of the same gen 500b+ model. It just won't and you are basically working with GPT-4 (the very first one to launch) tier performance while everyone else is on GPT-5.4. If that's fine for you because you can stay local, cool, but that's the part that no one ever wants to say out loud and it made me think I was just "doing it wrong" for so long on lm studio and ollama.


> We need more voices like this to cut through the bullshit.

Just because you can't figure out how to use the open models effectively doesn't mean they're bullshit. It just takes more skill and experience to use them :)


> We need more voices like this to cut through the bullshit.

Open models are not bullshit, they work fine for many cases and newer techniques like SSD offload make even 500B+ models accessible for simple uses (NOT real-time agentic coding!) on very limited hardware. Of course if you want the full-featured experience it's going to cost a lot.


I fell for this stuff, went into the open+local model rabbit hole, and am finally out of it. What a waste of time and money!

People that love open models dramatically overstate how good the benchmaxxed open models are. They are nowhere near Opus.


There is absolutely a use case for open models... but anyone expecting to get anywhere near the GPT 5.x or Claude 4.x experience for more demanding tasks (read: anything beyond moderate-difficulty coding) will be sorely disappointed.

I love my little hobby aquarium though... It's pretty impressive when Qwen Coder Next and Qwen 3.5 122B can accomplish (in terms of general agentic use and basic coding tasks), considering that the models are freely-available. (Also heard good things about Qwen 3.5 27B, but haven't used it much... yes I am a Qwen fanboi.)


I actually asked chatgpt to recommend me a great starter tmux conf, and it gave me 80% of this blog post. Not an insult btw.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: