I’m proud to say that I’m still using it, connected through Spotify! 23 years and counting, and it pretty much captured 100% of my listening activity. It’s a gimmick, but a nice gimmick to have and to be able to look up my music taste over time.
I think last.fm’s radio is actually better than Spotify’s, I still use it to discover new music.
It’s sad that the acquisition happened back then, they had a huge momentum and all that was instantly lost after the acquisition. Pretty much a textbook case of how not to do it.
That’s really about creating an MVP for a startup, because too many founders stay in a cave trying to make it “perfect” before collecting valuable user feedback.
This does not apply to Cloudflare, especially not for an auth token that needs to be published on your website that cannot be restricted.
Also, your local hardware is in no way capable of running the types of models that the cloud providers do, it’s just not economically feasible, and it never will be.
SanDisk has designed a flash equivalent to HBM, which has 1.6TB/s of bandwidth. I expect that it will be available initially to server manufacturers only, but once supply ramps up will be built into individual machines. At that point it will be practical to run local inference on much larger models. Of course, maybe the SOTA providers will find some way to use even larger ones, but it seems like the returns to scale aren't as much as they were.
Very much dependent on the situation. For many business tasks, local hardware is good enough. But what a lot of folks overlook when saying these things is that (a) workers do more than run AI models on a piece of hardware, (b) significant computer hardware is already sitting idle outside normal work hours, when it can be running batch jobs, and (c) employees can share local hardware.
Depends on what you mean by "economically feasible".
Even very cheap mini-PCs and laptops can run any of the models run by cloud providers, albeit at a much lower speed (i.e. with the weights stored on SSDs).
Whether such a low speed is useful, depends on the application. For something like a coding assistant or bug scanning, an instant response is desirable, but certainly not necessary.
The SSD would wear out in days while the laptop generates two responses a day. This is like saying you could power your home with AA batteries, yes technically you could but in practice entirely infeasible.
There is no wear on the SSDs, because the weights are just read, they are not written during inference.
For model training, the requirements are very different, and the training of a big LLM cannot be done with home equipment. On the other hand, inference can be done on almost any PC, even for LLMs with thousands of billions of parameters, just very slowly.
The only problem is that the inference becomes limited by the SSD reading throughput. Most of the cheap new personal computers available today can read simultaneously only 2 SSDs (if there are more they share a reading path), which are typically 1 PCIe 5.0 SSD and 1 PCIe 4.0 SSD. This has an upper throughput limit of 24 Gbyte/s, with 15 to 20 GB/s achievable in practice.
Then the speed in token/s is limited by the amount of weights that must be read per inference cycle. The ratio between output tokens and the amount of weights that must be read can be improved by various methods, like batching multiple tasks or using speculative decoding.
Faster SSD access improves performance more than RAM does, at least until all of the model is being cached in RAM. So older and cheaper HEDT platforms with lots of PCIe lanes to attach storage to are best for this approach.
The difference between datacenter hardware and cheap personal hardware is not in what can be run and what cannot be run.
Anything can also be run on a cheap computer.
The difference is in speed. A cheap computer may run a big model up to a few orders of magnitude slower than datacenter hardware, depending on whether the LLM is small enough to fit in GPU memory, or it is small enough to fit in CPU memory or it is so big that it must spill on SSDs.
Depending on the application, the tradeoff between run time and run cost may happen to favor using local hardware, despite a much slower speed.
There are plenty of applications where doing them for negligible cost during an overnight job can be preferable to obtaining faster results at a very high price, for instance scanning for bugs in a mature code base using a great number of different open-weights LLMs, which can achieve similar bug coverage like using a single, but overpriced and unavailable SOTA LLM, e.g. Mythos.
This kind of things can certainly be run locally, even on a small mini-PC, like a NUC, or even on a laptop, with the weights stored on SSDs.
Like I have said, the problem is not that they cannot be run, but that they may run more slowly than it is acceptable for a given application. Depending on the model, the speeds reported for inference with weights stored on SSDs vary from one token every few seconds to at most a few tokens per second.
Computers could solve relatively huge problems even in the early days of vacuum tube computers, when the main memories were measured in kilobytes, because at that time it was not expected that the data needed for problem solving must fit inside the main memory or even in the next tier of memory, with magnetic drums or magnetic disks, but the really big problems were solved by a great number of passes over data stored on magnetic tapes.
An LLM whose inference could not be run on a small mini-PC would have to be one hundred times bigger than the biggest existing SOTA LLMs.
Any LLM that exists today can be run on almost any PC, just extremely slowly in comparison with datacenter hardware.
Whether something is "impractical" depends on your expectations. High-latency unattended inference is definitely viable, even though it doesn't align much with what's being run in hyperscale datacenters.
I think the comments are a bit negative in this thread, however, Newton has nothing to do with Apple now. Or the last decade. Or the last 20 years. It's touching on 30+ years post launch now. Pointing at an "early idea" from 1993, is more the exception to the rule.
Products such as the ipod and then the iphone, were as the parent poster describes. Both ipod like devices, and the iphone were successors to other devices already on the market. It was how they were presented, packaged, and tailored that made them special and unique. Yet the launch of these devices are also in the range of two decades ago.
In the tech world, a few years is a long time let alone 20 or 30 years.
I'd say Apple is barely innovative now, and further, their 'early ideas' are long, long, long gone.
This is why it's such a shame that their products aren't as polished as they used to be. They still have a very strong capacity to do this, and I wish they would. It's a great market, and it's what a lot of people want. Take what's already on the market, as Jobs did with the iphone, or the ipod, and make it ... well, very nice to use.
Yet they seem to be stumbling here a bit, which is a shame.
“IMO the real vulnerability is located at the "Act" part of "ReAct" (reasoning and action) agent framework.”
This is a fancy way of saying that “the problem is tool calling”, which is obviously true. The problem is that, when it works correctly (99.99% of the time), it adds so much more value to LLMs.
Sandboxing is a step in the right direction, but can also add friction.
Using guardrails is also good, but adds latency, expenses, and also doesn’t solve 100% of the issues.
IMHO there currently does not exist a proper solution to this problem, and it has yet to be discovered. The proper solution, however, should NOT be based on LLMs, so guardrails are the incorrect direction (albeit effective and easier to implement).
By using "ReAct", I just wanted to emphasize the "agentic" perspective of tool calling, which makes tool calling facing the real world and at risk sometimes. So I'm not downplaying the significance of tool callings.
Yes I'm a builder of an agent infra on PCs, so I can completely sense that the protective measures are weak and inadequate, sometimes seeming like an unsolvable problem. But according to the article, what Microsoft did was hard to tell in a polite way. If they had even a little security awareness, I could completely understand, but it's like they've vibe coded the entire permissions system of Cowork.
Ultimately it all sounds like variations of “don’t blame the tool for situations the tool enables,” which has never been particularly convincing as an argument if you ask me.
And consistently quotes information from them, yes. They quite like showing reddit and news icons while searching, but expand the references and it paints a rather different picture, especially for common searches which are flooded with junk. Niche stuff seems more likely to reference decent sites, but have massively worse hallucinations.
I think last.fm’s radio is actually better than Spotify’s, I still use it to discover new music.
It’s sad that the acquisition happened back then, they had a huge momentum and all that was instantly lost after the acquisition. Pretty much a textbook case of how not to do it.
reply