"I am not sure how many people will run AI models locally. It still seems like a niche application to me. However, it will make decent machines to play video games."
I don't know who will be the winner but with some of the recent releases from gemma it seems more probable that you may run some models locally if only from a cost perspective, not even considering business security. Not sure how this type of architecture would make for good gaming though, puts into question the whole statement.
"Ranked in the top 2% of scientists globally (Stanford/Elsevier 2025) and among GitHub's top 1000 developers" - side note but this guy puts this everywhere, gives me probably the inverse of what he is marketing for.
"I am not sure how many people will run AI models locally. It still seems like a niche application to me. However, it will make decent machines to play video games..."
This is the 2026 edition of Ken Olsen:
"There is no reason anyone would want a computer in their home"
> This is the 2026 edition of Ken Olsen: "There is no reason anyone would want a computer in their home"
Digging into this:
> In conclusion, there is evidence that Ken Olsen did doubt the need for computers in the home, but the evidence is based primarily on the testimony of David Ahl who was perturbed when the personal computer project he championed at DEC was not supported by Olsen in 1974.
> Olsen’s resistance may have been similar to that expressed by another DEC executive, Gordon Bell. In 1980 Bell thought home terminals would act as gateways to remote computers which would provide appropriate services.
It was supposedly said in 1977: most computers at that time were not small, and so it would not be surprising that people would not expect the general public to desire a large, power-hungry, noise-y apparatus in their house.
This is why I'm bearish on Anthropic, OpenAI, and friends. I am not confident that we will continue to see the same pace of improvement in frontier model capabilities as we have seen over the past year or two - not using similar mathematics at least. But I think that getting results that are close enough to the same standard to be a realistic substitute but in a model small enough to run locally may well happen quite quickly. And if it does - where is the moat to defend these AI organisations with their astronomical budgets when they're already starting to price more realistically and that's already killing a lot of the hype they've enjoyed until very recently? They have an accidental moat because they bought up the global supply chain for storage but that surely isn't going to last once the data centres to hold that storage are becoming liabilities.
If model performance asymptotes and CPU/GPU and RAM keep growing, even slowly, then eventually we will have frontier models on desktop that are totally competitive with hosted. It’s only a matter of time.
You already can if you’re willing to spend many thousands of dollars on a beast of a machine. I’m talking about middle tier desktops and laptops here. Maybe eventually even phones.
The only way hosted stays strongly competitive in that world is if they can keep pushing the frontier or by playing the classic social media and SaaS games of network effect building and integrations.
Many people might still use hosted, of course, but what I really mean is that their multiples won’t be justified and they will have little to no moat. AI will become commoditized, like a sophisticated next generation form of an encyclopedia with search.
> This is why I'm bearish on Anthropic, OpenAI, and friends.
Just because you can do more and more things at home (thanks Moore and Dennard), doesn't preclude needing things also done remotely. The number of at-home systems seems to have fed a growing number of remote systems (especially once always-on connectivity became ubiquitous).
It's basically the angle Apple is going for: do as much locally (for the sake of privacy), and then offload when it becomes "too much".
I agree that one doesn't preclude the other. But the sky high valuations we've been seeing for the AI industry recently can only be justified if they bring about a fundamental change in our society and those companies continue to bring in the lion's share of the resulting profits. I don't see why everyone else in our society - particularly other large businesses with lots of money to invest - is going to play a game by the AI companies' rules once they can take their ball and go home and still have most of the fun without paying much for it by comparison.
We kinda ended up with terminals connected to mainframes anyway. The terminal being the web browser, and the mainframe being SaS. So it wasn't that far off.
People take these quotes out of context all the time. Said in a business context, there was no need, at that time, for someone to have a personal computer.
There's no business justification in 1977 for a personal computer department at a business. It's similar to the gates quote about RAM (I think it was 64KB?).
These statements aren't meant to be forever quotes. Their business plan quotes.
That exact quote? No, never.
He said something like: current computers at the time had 64kb of RAM, so the OS was designed with a limit of 640kb, and he believed this would give them 10 years of future proofing. As it happened, that limit was reached much faster, in about 6 years.
He had a long career and presumably many successes, and is fallible like the rest of us. But a half-remembered zinger with no context makes for zippier posts I guess.
The early popularity of Minitel, the continued popularity of ssh/tmux, and the web browser itself indicates that bespoke client applications are not the only way. He wasn’t directionally wrong.
I will not be spending thousands in hardware to run the worlds most mediocre llms at meh speeds. Sorry. I know for llm bros they think every output made by an LLM is magic, like every NFT guy thought every NFT collection was game changing, but there's nothing useful you can do with llms and 128gb of RAM (and there never will be) unless you have llm psychosis. Who cares.
Nothing isn't quite right but you wouldn't be using it like the hosted ones. 128gb is more than enough to run models to index my files and photos, denoise photos / AI photo masking, magic eraser type tasks for images, frame generation for gaming, etc.
Even for a lot of LLM type tasks, 128gb is likely more than enough to control a lot of PC configuration and automation with natural language.
Nobody ever said that, at least not as an assertion or prediction. The actual instances of similar language are from multiple people describing their earlier thoughts before they learned it wasn’t true.
It’s better, it’s useful even for those who don’t have a deep knowledge of computers. I’d expect more AI users than programmers, than ms-word users, than excel users.
Local models aren’t deterministically equivalent in capabilities to foundation models. Home computers are turing complete; just like a mainframe. They are just slower. Often not slower enough to matter.
Most people are ok with slower. An AI that lets you edit a family picture, in say 30 seconds, locally is preferable to one that is instantaneous but requires you to submit that picture to examination/storage/training/sale in someone else's AI ecosystem. If i want to crop my ex out of family photos, i should not have to first give that photo to Microsoft. If want an LLM to write a book report for me, i dont want it also alerting my school. And if i write a memo for a client, and i want an LLM to check the spelling, i dont want that memo leaked either.
I'd like to think so but the existence of Google and Apple and Microsoft's cloud based photo tools with phone integration suggests that's false.
You could run a pretty good home server on $50 of gear and yet we never saw any real adoption of OwnCloud/NextCloud style products as an alternative to Google Drive/Photos or Apple Cloud.
Why should LLM/Transformers be any different? Especially when you need a proper expensive GPU to run them instead of a Raspberry Pi?
> Most people are ok with slower. An AI that lets you edit a family picture, in say 30 seconds, locally is preferable to one that is instantaneous but requires you to submit that picture to examination/storage/training/sale in someone else's AI ecosystem.
Maybe if you ask them that question, but if you show them two products, they'll definitely prefer the faster one. 30 seconds is a long time to watch a progress bar.
Plus there's the other question. If this thing is slower ... what's the price? The desktop/mini-pc version of this is $3000, after all. At this performance level what is an acceptable price for the laptops?
People definitely aren't going to accept more expensive + slower ...
Fast and public, or slow and private. Not everyone wants, or is allowed to, share their data with the AI world. And do not doubt that every bit shared with an AI service will be used for training.
The question here is about markets though. Not everyone wants x but if the vast majority of people want y, x is going to be niche and expensive.
You don't think the commercials of Google's AI photo features aren't going to have an impact on Apple users of their phones can do a worse version of that feature and it takes longer?
It’s completely technically possible to have cloud services where customer data is opaque to the provider. Some of Apple’s services are like this already, for example.
I think there’s a sweet spot currently with munging your data blindly on the server so that your client device battery still lasts all day.
Meanwhile Apple and others push on with making client side models more efficient so that eventually the server costs and complexities go away.
If asked to choose between photo editing done within 3s using cloud provider vs an average of 30s using local compute, most consumers will choose the former without hesitation.
Most users' usage is also going to fall nicely in the free tier of a typical freemium pricing model, like ChatGPT today.
People who talk endlessly about local inference have no idea about user workflows and usability.
You may not, but experience shows that most people are just fine sharing the most personal stuff not only with cloud services, but with hole world through anti-social media.
Qwen 3.6 is far ahead of Gemma for most (but not all) things. I've deployed it out across a number of M5 MacBooks and it's genuinely useful for many tasks. It won't replace an Opus or current gen Sonnet sized model but it's still amazingly good for its size and probably as good as or just a bit before Sonnet 4 era. Far more reliable for tool calling, coding, agentic tasks and faster than the Gemma models especially with MTP.
Qwen 3.6 is a toy compared to DeepSeek V4 Flash or Pro. These models can now run on Apple Silicon hardware with as little as 32GB RAM for the Flash (with 2-bit quant, which is still quite capable) using SSD offloading, with just-about-reasonable performance for interactive use, and far better performance on longer contexts than Qwen (due to the more efficient KV cache/attention mechanisms in DeepSeek).
Very significant improvements may be viable for unattended inference via large-scale batches, which can reuse sparse experts and thereby mask some of the latency involved - this is quite unique to DeepSeek, again due to its efficient KV cache.
1. Deepseek V4 is still in preview (training is not finished)
2. Qwen is much more demanding and borderline unusable on consumer hardware because it's a dense model. The 27B parameters are active all time for each token. It's not a MoE architecture where a router activates only some of them.
I have to disagree with most claims. I run Qwen3.6-27b at 260k context and 40-60 tok/sec. It handles most coding problems as well as Sonnet 4.6 under OpenCode on our production tasks. (As an experiment, I run the same prompts for the same issues in parallel for Qwen 3.6 and Sonnet 4.6 and usually see little difference in performance). I see zero degradation from quantization in practice.
Last time I tried running large MoEs on this PC, they had inferior quality at 2-3 bits compared to much smaller dense models at 5-6 bits, and were slower anyway.
A 260k context (close to the stock maximum for Qwen, though it's possible to extend it) will take ~16GB RAM for storing the KV cache, barring quantization tricks which severely degrade quality. That's a whole lot more than what DeepSeek requires for a similar context length, and makes it infeasible to batch multiple inferences together. This used to be the status quo for consumer inference, in fact it still is for models like Kimi and GLM (which can sometimes be smarter than even DeepSeek V4 Pro!) but we can also do better nowadays.
Deepseek V4 Flash still has 13B active params though? That is about half as many as Qwen3.6-27B (and much more than Qwen3.6-35B-A3B). Given that RAM (even on a base M4 or 'regular' Intel/AMD system) is like an order of magnitude faster than an SSD, even Qwen 27B running from RAM will be much faster than any Deepseek V4 model with SSD offloading. And the MoE will be much faster still.
Qwen 27B is also small enough to completely fit in a high-end consumer or mid-end pro GPU, like an RTX 5090 or Radeon PRO R9700. I found results claiming 30 tokens per second generation for 27B(-Q4_K_XL) on an R9700. I doubt you get more than 5 tokens per second doing SSD MoE streaming.
Even for relatively short contexts, I honestly already find the ~30B class MoE models to be only borderline acceptable in terms of speed on my laptop (Ryzen 7 7840U, 64 GB LPDDR5-6400), though I use Gemma 4 26B-A4B more than Qwen3.6 35B-A3B.
> even Qwen 27B running from RAM will be much faster than any Deepseek V4 model with SSD offloading.
If you have reasonable amounts of RAM to cache the most likely experts, that's not true at all. Qwen 27B is marginally faster on a nearly empty context, then falls behind as context length increases due to the different attention mechanisms. Prefill for Qwen is much faster, but you're still comparing vastly different model sizes and capabilities. DeepSeek Flash is the best deal overall.
> completely fit in a high-end consumer or mid-end pro GPU
Or you could fit the dense portion of a much more capable model and still take advantage of that hardware.
Is that how MoEs work? I though that an important constraint for MoEs is that experts need to be uniformly used to make sure they can be used effectively. If there is a 'common subset' that, if anything, sounds like a symptom of undertraining (i.e. the same trick will not work as well for Deepseek V4.1).
Also, even if your MoE hitrate is 90%, you still spend half your time waiting for the SSD, giving similar total speed to a 27B model!
Finally, it looks like Deepseek V4 is pretty much only runnable with antirez's ds4, and SSD streaming only works with Metal; but I would like to try what you say with llama.cpp which uses mmap to also potentially do SSD streaming. (I can maybe try the large Qwen3.5 MoEs?)
> as context length increases
What kind of context length do you consider reasonable, though? From what I know, all models (even frontier ones) start degrading once you pass a few hundred thousand tokens. So realistically, limiting context size might even improve quality, especially if you use token-efficient harnesses.
> Or you could fit the dense portion of a much more capable model and still take advantage of that hardware.
Your point about consumer hardware was that it would be "borderline unusable" when running Qwen 3.6 27B. However, you need much less hardware to run a 27B than DSv4 Flash. In addition, you can do the same 'trick' with low-end GPUs and small MoEs: my desktop with 32 GB DDR4-3200 and an RTX 2070 8GB can run the ~30B class MoEs at 20-30 tokens per second and similar speeds to my laptop.
For any given workload/session? Empirically, yes, that's what has been found across different models. There's quite a bit of predictability that makes caching helpful.
> Also, even if your MoE hitrate is 90%, you still spend half your time waiting for the SSD, giving similar total speed to a 27B model!
There are ways of masking some of that latency, though it requires some architecture-specific cleverness which is less directly applicable to a generic engine like llama.cpp.
> Finally, it looks like Deepseek V4 is pretty much only runnable with antirez's ds4, and SSD streaming only works with Metal
The llama.cpp folks are working on adding support, and the ds4 project is working on CUDA support for streaming inference, targeting the DGX Spark.
> From what I know, all models (even frontier ones) start degrading once you pass a few hundred thousand tokens.
DeepSeek V4 seems to do quite well on recall tasks even with large context. That's one plausible benefit of its compressed attention mechanism, compared to earlier models. Some degradation will likely still be there, but it's not necessarily obvious.
As for why people are calling Qwen 27B "borderline unusable" that may have to do with it being a dense model which makes for an increased compute intensity and pushes users towards discrete GPU platforms, since those tend to have the most compute overall as far as consumer hardware is concerned. I might agree that Qwen 27B is quite ideally tailored towards these platforms, but that does come with some limitations.
I've got a Qwen 3.5 running on a 12GB 3060 and it's dumb as a stump but still smart enough to get some useful work done. Since it's my daily driver desktop I havent jumped to 3.6 since last time I did I quickly ran out of vram and locked the desktop environment.
But yeah, the Qwen line is pretty impressive on commodity hardware.
I must be using LLMs very differently than y'all, because I can't think of a single thing I would rely on an LLM that's "dumb as a stump" to do for me.
To me, LLMs are for asking research questions + exploring design spaces + pointing at codebases to investigate bugs. And those all benefit from the model being as "smart" (in terms of both fluid intelligence and burned-in knowledge) as possible.
I'm guessing there exist problems where "intelligence past a certain point" doesn't matter, so these medium-sized models can match the performance of the bigger models. But what problems might those be?
Things that are tedious but simple but I'm unfamiliar with.
"Go add a gh action to compile and deploy this thing and run its tests" is one I've found it's good at. Yes I know how to make a gh pipeline but it's always a hassle to remember what goes where.
Cranking out unit tests is okay. It's good at summarizing things so it's not half bad at writing jsdoc/xmldoc comments.
> you may run some models locally if only from a cost perspective
I have a hard time believing running a model on a laptop will be cheaper than running it in a datacenter. Why wouldn't economies of scale apply here as with every other computation?
AI models will pretty undeniably affect your electricity bill; yes you already own the computer, but it will cost more to run it if it's doing inference!
To a point, but we're talking a laptop, not a server farm. Even if you're going fullbore wide open 24/7 that's about $150/yr in electricity bills at average rates. Not quite nothing but in terms of AI costs that's pretty close to rounding to zero.
This is assuming that you'll be priced the fraction of computing that you consumed. But you are actually paying for their infrastructure, for the R&D (and also the computation that went into training the model) etc.
It is not clear that, for your own small computations, this kind of costs are needed, but you will still pay your share in the investment the provider made so that they could serve everyone's computation needs.
But, currently ... you're not. AI companies are operating at a loss, and are being subsidized by their investors.
Local may or may not be cheaper than remote now, depending on the details, but the factors you describe won't affect the math nearly as much as they will once that subsidization ends.
In that analogy bigtech AI is currently investing in cleaner air for all of us? We _could_ breath it through their hose, but might as well breath it outside.
The datacenter setting has huge economies of scale for low-latency, just-in-time inference using extremely large models, but that's not the only viable use of AI. Batched, unattended inference of possibly smaller and weaker models, while theoretically viable in a datacenter setting, is far from the best use of that hardware. This is where local AI is at its best.
A laptop is really a pretty bad form factor to run LLMs. Worst cooling, more expensive memory that you cannot replace, resell value depreciating fast. It’s fine for tinkering, small scale research, and demos but it’s definitely niche.
The vision NVIDIA is selling is pure marketing IMHO
Does it apply for every other computation? Purely for the computation part?
You can host all kinds of things locally cheaper right now than in the cloud, no? (At least pre memory price hikes.)
It does, of course, come with its downsides like availability/reliability, less convenience, scaling options,..., but purely the computing price - I don't see why it wouldn't be cheaper in the future - at least for some use cases.
What "every other computation"? I seem to have a lot processing power at my disposal here, between my cell phones, laptops, gaming PCs, various other hardware devices.
You're going to need to analyze the problem much more deeply because it sound like the standards you are implicitly applying would result in "economically, everything should be centrally hosted" but that is clearly not the result that obtains. Even a modern mid-grade cell phone is no slouch; you may not be running a current-gen frontier AI on it but you certainly can do a lot of other rather intense things locally that would have been laughable 10 years ago, like suprisingly high powered games.
I also don't get why this twitter user is linked here, versus all the news articles about this new hardware that have been everywhere over the past number of days.
I also dislike his self-promotion, but his work _is_ well know and, as far as I know, well looked upon. I think he has more expertise and knowledge in this area than most (including what you'd find in the news).
Ahh, thanks, that explains things a little more. I wasn't familiar with the author, and his tweet just read like one of those people on Linkedin who regurgitates knowledge and passes it off as their own insight.
The security aspect is the main driver why I’m seeing so many businesses investing in local hardware. They know the models aren’t as good (caveat that they also can’t run Chinese models) and that’s ok. Places that really care about security and data governance already aren’t on the bleeding edge. They wait for the nice stable lts version, they lock down dev machines in frustrating ways and have lots of IT admin layers.
But they also want to taste the sweet fruit of AI so the only way to do this that a CISO will approve is on local air gapped hardware. It’s a niche but still a billion dollar niche.
I hope a family-level AI appliance is a thing later. Local non-cloud assistant that lives in the house, families interact via voice or phones or whatever. Knows the contextual family stuff you need, etc.
We didn't get people buying family-level file servers for the family photo gallery and documents at any real scale, so i doubt we'll see similar for AI especially when the cost is that much higher for GPUs vs an SBC machine.
because nas hardware and software suck and everything else was a poorly executed subscription product...i think one was called helm, another was by early twitter alumni. imagine a home device that manages and maintains itself and is a joy to interact with.
not automatically, but a meaningful step up in ease of use (managing photo/video backup from all family devices) without a subscription would be a solid foundation
Lots of people are already running AI locally. They are the people buying up all the consumer-grade nvidea gpus. What are they doing with them? Well, the same things people with home media or email servers are doing: stuff they dont want to share with the general public.
I want to reduce my dependency on companies like Google, OpenAI, and Anthropic. Aside from the concerns of data sharing I'm also not a fan of how they run their operations, for example Anthropic now using xAI's Colossus data center which is poisoning a marginalized community, or OpenAI getting in bed with the military.
Not everything I want to use an LLM for requires "PhD level intelligence", and increasingly I'm finding more uses that involve sharing my personal data.
Yesterday my local model helped me when looking for a doctor who is in-network for my insurance. I threw it a screenshot from the providers search results and it looked up reviews for all of them.
My local AI is currently upscaling an old british comedy from sub-dvd quality to 1k. (It is not availible other than on DVD.) It looks like it will take about a week for my pair of 5060s to chew through the task.
I own the DVDs so I'm OK upscaling/editing my own copies for my own use. But if I ran the task on an ai service I would no doubt trigger copyright issues.
I suspect personal privacy and need to run AI workflows to handle the litany of administration tasks of a household will be what result in regular need for local AI.
Apple is already out front with this on a personal, individual level, but they are not obviously headed toward multiuser/family-level ~biz admin with a persistent server running local LLM.
You must be unaware that System76 was already selling 192GB machines, mac studios used to be 512GB max. The only reason why we don’t have them anymore is that we are in RAM shortage.
You assume I use a subscription. There are other options but they require more than 128GB unified RAM. You also assume a lot about how I work. And those final assumptions about what and how I think of others speak more about your anxieties rather than what I think.
You assume a lot. Sometimes it’s good to simply ask a question.
Those 192GB aren't unified memory though. 128GB on Mac or 395 can be used by both CPU and GPU. It's the GPU + large memory that opens up fast local LLM inteference.
Yes, true. But if we had the ability to buy that much RAM in the laptop, everyone would be looking in that direction. Until this thing discussed here comes to the market, “we didn’t have computers with unified 128GB RAM either” (except of macs).
He’s just a braggart. When you see something like this in somebody’s personal bio on social media, it’s basically a banner that means “take everything I say in the context of me promoting myself.”
> "Ranked in the top 2% of scientists globally (Stanford/Elsevier 2025) and among GitHub's top 1000 developers" - side note but this guy puts this everywhere, gives me probably the inverse of what he is marketing for.
Lol yeah seriously, that stinks "I ask AI to generate a huge amount of bullshit and upload it to pad irrelevant stats".
I agree that it sends the wrong symbol, but actually Daniel is great. He cares tremendously about doing work that is actually real-world useful. I've co-written a few papers with him, and he's really hard working and open to outside suggestions. The danger is that if you send him comments, he'll eventually manage to rope you into writing a new and improved version. Seriously, if you are a non-academic computer scientist with a good idea that you want to publish, he'd be incredibly open to working with you.
As to why he now has this on his blog? I also cringe when I read it. I presume someone told him he should self-promote more, and this is his lame attempt to do so. He's almost certainly the most cited person in his department, but it's entirely possible that none of his colleagues actually know this. Cut him some slack. Self-promotion is not his strength. He's a nerd's nerd, and not a marketer. I'll mention to him that his attempt here might be backfiring when I'm next in contact with him.
I kind of get it in the sense that every academic has to make themselves somewhat comfortable with self-promotion even if they don't like it. It's an important part of getting funding, but putting a blurb like that everywhere just hurts his credibility I think.
He's not a loser; he's done some really fun work that many people use daily. I've used his range mapping trick in multiple projects/papers. It's elegant.
It sounds like he's gotten bad advise about how to market himself /or/ this is being marketed to people who have bigger checks to write and whom he believes will be responsive to this kind of marketing. As an academic, it rubs me very wrong - I think it's detrimental to the field when we get into h-index stacking contests or citation count comparisons. But I don't know what incentives he's responding to, which seems important for putting this stuff in context.
(as an aside, it turns out that polars + fastexcel is about 10x faster than pandas + openpyxl for searching that dataset, if anyone else is curious what he was actually talking about. :)
I think the local-model use case is going to become less niche pretty quickly if the models keep getting smaller and more capable. Even if most people do not care about privacy or offline use, the cost argument is pretty strong
I don't know who will be the winner but with some of the recent releases from gemma it seems more probable that you may run some models locally if only from a cost perspective, not even considering business security. Not sure how this type of architecture would make for good gaming though, puts into question the whole statement.
"Ranked in the top 2% of scientists globally (Stanford/Elsevier 2025) and among GitHub's top 1000 developers" - side note but this guy puts this everywhere, gives me probably the inverse of what he is marketing for.