More

stingraycharles · 2026-05-27T22:23:11 1779920591

I’m proud to say that I’m still using it, connected through Spotify! 23 years and counting, and it pretty much captured 100% of my listening activity. It’s a gimmick, but a nice gimmick to have and to be able to look up my music taste over time.

I think last.fm’s radio is actually better than Spotify’s, I still use it to discover new music.

It’s sad that the acquisition happened back then, they had a huge momentum and all that was instantly lost after the acquisition. Pretty much a textbook case of how not to do it.

stingraycharles · 2026-05-27T04:04:15 1779854655

Don’t forget they now also have an OpenRouter alternative.

stingraycharles · 2026-05-27T04:03:13 1779854593

That sounds like the product is not finished and should not be released?

nine_k · 2026-05-27T06:36:32 1779863792

"If you are not ashamed by what you are shipping, you are not shipping early enough" (Quoting from memory)

stingraycharles · 2026-05-27T10:25:48 1779877548

That’s really about creating an MVP for a startup, because too many founders stay in a cave trying to make it “perfect” before collecting valuable user feedback.

This does not apply to Cloudflare, especially not for an auth token that needs to be published on your website that cannot be restricted.

mceachen · 2026-05-27T14:24:55 1779891895

If your engineer tells you that, you're going to have a bad time.

I think you're thinking this:

> If you are not embarrassed by the first version of your product, you've launched too late. -Reid Hoffman

crabmusket · 2026-05-27T07:13:25 1779866005

That's a terrible attitude for an infrastructure company. This is what private betas / close iteration with customers is for.

ai_fry_ur_brain · 2026-05-27T04:36:32 1779856592

This has been the Cloudflare standard operating procedure for the last year or so. Non stop shipping alpha/beta products.

rustystump · 2026-05-27T05:58:42 1779861522

Otherwise known as vibe code snacking. Vibe out the easy 80% and say the hard 20% is “coming soon tm”

stingraycharles · 2026-05-27T02:57:44 1779850664

“and hence you see the rise of chinese models to establish contracts globally”

how will that help them working around the distill issue?

gessha · 2026-05-27T03:09:50 1779851390

Collecting user data directly by competing on price. The next step would be figuring out how that data can bring them closer to SOTA.

stingraycharles · 2026-05-27T03:38:29 1779853109

Yes ok but that doesn’t give them the thinking tokens, how to reason about the prompt, which is precisely what’s most important.

stingraycharles · 2026-05-27T02:56:23 1779850583

So you think they’re running the same types of state of the art Nvidia deployments?

onlyrealcuzzo · 2026-05-27T03:12:14 1779851534

It's supposed to be even MORE expensive:

Nvidia H100: Typically priced around $25,000–$30,000 (global MSRP).

Huawei Ascend 910C: Reported to cost roughly $28,000, yet it delivers only 60% of the inference performance of the Nvidia H100.

Google's TPUs are significantly cheaper for Google for inference. That's pretty much it.

There's a reason nVidia has an 80% margin right now.

thrownthatway · 2026-05-27T03:22:57 1779852177

MSRP is irrelevant in this context.

stingraycharles · 2026-05-26T15:09:35 1779808175

Also, your local hardware is in no way capable of running the types of models that the cloud providers do, it’s just not economically feasible, and it never will be.

ajb · 2026-05-27T08:39:11 1779871151

SanDisk has designed a flash equivalent to HBM, which has 1.6TB/s of bandwidth. I expect that it will be available initially to server manufacturers only, but once supply ramps up will be built into individual machines. At that point it will be practical to run local inference on much larger models. Of course, maybe the SOTA providers will find some way to use even larger ones, but it seems like the returns to scale aren't as much as they were.

bachmeier · 2026-05-26T16:04:26 1779811466

Very much dependent on the situation. For many business tasks, local hardware is good enough. But what a lot of folks overlook when saying these things is that (a) workers do more than run AI models on a piece of hardware, (b) significant computer hardware is already sitting idle outside normal work hours, when it can be running batch jobs, and (c) employees can share local hardware.

adrian_b · 2026-05-26T18:16:23 1779819383

Depends on what you mean by "economically feasible".

Even very cheap mini-PCs and laptops can run any of the models run by cloud providers, albeit at a much lower speed (i.e. with the weights stored on SSDs).

Whether such a low speed is useful, depends on the application. For something like a coding assistant or bug scanning, an instant response is desirable, but certainly not necessary.

christina97 · 2026-05-26T18:56:28 1779821788

The SSD would wear out in days while the laptop generates two responses a day. This is like saying you could power your home with AA batteries, yes technically you could but in practice entirely infeasible.

adrian_b · 2026-05-26T22:27:45 1779834465

There is no wear on the SSDs, because the weights are just read, they are not written during inference.

For model training, the requirements are very different, and the training of a big LLM cannot be done with home equipment. On the other hand, inference can be done on almost any PC, even for LLMs with thousands of billions of parameters, just very slowly.

The only problem is that the inference becomes limited by the SSD reading throughput. Most of the cheap new personal computers available today can read simultaneously only 2 SSDs (if there are more they share a reading path), which are typically 1 PCIe 5.0 SSD and 1 PCIe 4.0 SSD. This has an upper throughput limit of 24 Gbyte/s, with 15 to 20 GB/s achievable in practice.

Then the speed in token/s is limited by the amount of weights that must be read per inference cycle. The ratio between output tokens and the amount of weights that must be read can be improved by various methods, like batching multiple tasks or using speculative decoding.

jurgenburgen · 2026-05-27T06:30:09 1779863409

Does more RAM increase performance? This approach sounds like it could eventually be fast enough for local use as hardware and models improve.

zozbot234 · 2026-05-27T07:46:34 1779867994

Faster SSD access improves performance more than RAM does, at least until all of the model is being cached in RAM. So older and cheaper HEDT platforms with lots of PCIe lanes to attach storage to are best for this approach.

jyounker · 2026-05-26T19:14:58 1779822898

Weights are write-once data.

zozbot234 · 2026-05-26T15:37:18 1779809838

It can run open-weight models that are roughly as capable. It's going to be slow unless you're using actual datacenter hardware, but they'll run.

colonCapitalDee · 2026-05-26T15:40:52 1779810052

"roughly" is doing a lot of heavy lifting there

adrian_b · 2026-05-26T18:24:18 1779819858

The difference between datacenter hardware and cheap personal hardware is not in what can be run and what cannot be run.

Anything can also be run on a cheap computer.

The difference is in speed. A cheap computer may run a big model up to a few orders of magnitude slower than datacenter hardware, depending on whether the LLM is small enough to fit in GPU memory, or it is small enough to fit in CPU memory or it is so big that it must spill on SSDs.

Depending on the application, the tradeoff between run time and run cost may happen to favor using local hardware, despite a much slower speed.

There are plenty of applications where doing them for negligible cost during an overnight job can be preferable to obtaining faster results at a very high price, for instance scanning for bugs in a mature code base using a great number of different open-weights LLMs, which can achieve similar bug coverage like using a single, but overpriced and unavailable SOTA LLM, e.g. Mythos.

stingraycharles · 2026-05-26T22:29:41 1779834581

> The difference between datacenter hardware and cheap personal hardware is not in what can be run and what cannot be run.

You do realize that a model like Opus is (estimated to be) around 5T parameters, and uses around 5TB of GPU memory?

These kind of things are just impossible to run locally.

adrian_b · 2026-05-26T22:44:34 1779835474

This kind of things can certainly be run locally, even on a small mini-PC, like a NUC, or even on a laptop, with the weights stored on SSDs.

Like I have said, the problem is not that they cannot be run, but that they may run more slowly than it is acceptable for a given application. Depending on the model, the speeds reported for inference with weights stored on SSDs vary from one token every few seconds to at most a few tokens per second.

Computers could solve relatively huge problems even in the early days of vacuum tube computers, when the main memories were measured in kilobytes, because at that time it was not expected that the data needed for problem solving must fit inside the main memory or even in the next tier of memory, with magnetic drums or magnetic disks, but the really big problems were solved by a great number of passes over data stored on magnetic tapes.

An LLM whose inference could not be run on a small mini-PC would have to be one hundred times bigger than the biggest existing SOTA LLMs.

Any LLM that exists today can be run on almost any PC, just extremely slowly in comparison with datacenter hardware.

dns_snek · 2026-05-27T07:59:43 1779868783

When people say that you "can't do" something what they actually mean is that it's completely impractical (if not impossible).

zozbot234 · 2026-05-27T11:02:57 1779879777

Whether something is "impractical" depends on your expectations. High-latency unattended inference is definitely viable, even though it doesn't align much with what's being run in hyperscale datacenters.

dns_snek · 2026-05-27T12:23:30 1779884610

I'd like to meet the person who's been using a 1 token/second system as their primary LLM for at least a few weeks. Anyone?

I think 1 token/second is optimistic here - and even then it's over 11 days per million tokens.

devmor · 2026-05-26T16:24:42 1779812682

> it never will be.

Giving strong “640k is enough for anyone” vibes here.

3form · 2026-05-26T20:56:25 1779828985

640k statement was absolute, this one is comparative.

Cloud should have more compute and efficiency than local. I wouldn't be 100% sure, as I don't know what I might not be seeing, but still.

Whether that comparative advantage will matter, though, is a completely different question.

devmor · 2026-05-27T00:50:48 1779843048

Gotcha, I think I misunderstood the statement as saying today’s cloud-required will never be local-capable.

cortesoft · 2026-05-26T17:12:08 1779815528

NEVER will be is a pretty big leap. Never is a long time.

stingraycharles · 2026-05-26T06:55:28 1779778528

It’s unlike Apple to be too early with something. Usually it’s the competition and they show how it should really be done.

I guess the main problem here is the price point, which will improve over time and with scale.

0bytes · 2026-05-26T08:12:09 1779783129

The Newton team might disagree.

b112 · 2026-05-26T08:54:04 1779785644

I think the comments are a bit negative in this thread, however, Newton has nothing to do with Apple now. Or the last decade. Or the last 20 years. It's touching on 30+ years post launch now. Pointing at an "early idea" from 1993, is more the exception to the rule.

Products such as the ipod and then the iphone, were as the parent poster describes. Both ipod like devices, and the iphone were successors to other devices already on the market. It was how they were presented, packaged, and tailored that made them special and unique. Yet the launch of these devices are also in the range of two decades ago.

In the tech world, a few years is a long time let alone 20 or 30 years.

I'd say Apple is barely innovative now, and further, their 'early ideas' are long, long, long gone.

This is why it's such a shame that their products aren't as polished as they used to be. They still have a very strong capacity to do this, and I wish they would. It's a great market, and it's what a lot of people want. Take what's already on the market, as Jobs did with the iphone, or the ipod, and make it ... well, very nice to use.

Yet they seem to be stumbling here a bit, which is a shame.

stingraycharles · 2026-05-26T04:42:11 1779770531

This is just how things work when there’s much less overhead. Which is typically the case for smaller companies.

stingraycharles · 2026-05-26T00:15:31 1779754531

“IMO the real vulnerability is located at the "Act" part of "ReAct" (reasoning and action) agent framework.”

This is a fancy way of saying that “the problem is tool calling”, which is obviously true. The problem is that, when it works correctly (99.99% of the time), it adds so much more value to LLMs.

Sandboxing is a step in the right direction, but can also add friction.

Using guardrails is also good, but adds latency, expenses, and also doesn’t solve 100% of the issues.

IMHO there currently does not exist a proper solution to this problem, and it has yet to be discovered. The proper solution, however, should NOT be based on LLMs, so guardrails are the incorrect direction (albeit effective and easier to implement).

EFLKumo · 2026-05-26T00:35:58 1779755758

By using "ReAct", I just wanted to emphasize the "agentic" perspective of tool calling, which makes tool calling facing the real world and at risk sometimes. So I'm not downplaying the significance of tool callings.

Yes I'm a builder of an agent infra on PCs, so I can completely sense that the protective measures are weak and inadequate, sometimes seeming like an unsolvable problem. But according to the article, what Microsoft did was hard to tell in a polite way. If they had even a little security awareness, I could completely understand, but it's like they've vibe coded the entire permissions system of Cowork.

Forgeties79 · 2026-05-26T00:18:46 1779754726

Ultimately it all sounds like variations of “don’t blame the tool for situations the tool enables,” which has never been particularly convincing as an argument if you ask me.

stingraycharles · 2026-05-25T14:43:13 1779720193

For simple queries it’s fine. The main value it adds is that it’s above the ad spam.

jryan49 · 2026-05-25T15:03:22 1779721402

But it was trained on ad spam wasn't it?

Groxx · 2026-05-25T16:42:19 1779727339

And consistently quotes information from them, yes. They quite like showing reddit and news icons while searching, but expand the references and it paints a rather different picture, especially for common searches which are flooded with junk. Niche stuff seems more likely to reference decent sites, but have massively worse hallucinations.