More

ohadron · 2026-05-20T18:21:19 1779301279

This is great. Agentic coding at 600+ tokens/sec is going to be a radically different beast. Coming soon-ish?

dkersten · 2026-05-20T18:31:09 1779301869

For small enough tasks with tight enough workflows, you can have it right now. Ie if you can constrain the task to work well with GPT OSS 120B/llama 3.3/qwen 3, then you can get upwards of 600 TPS on groq and up to 3k TPS on Cerebras.

Those models aren’t comparable to Opus, or even weaker models like MiniMax, but for certain task (focused context and prompts, strict workflows, single purpose requests) you absolutely can use these models and get insane speeds.

black_knight · 2026-05-20T18:59:47 1779303587

People seem to use these tools very differently from each other. I value intelligence over speed any day. My programs are written in Haskell, so there are rarely any tasks which require thousands and thousands of lines to solve. Just intelligence. If there are rote tasks, I want the LLM to help me find intelligent ways of automating it: the right abstraction, the right meta-programming technique.

I constantly push Opus and GPT, and they are getting better. But still have to do the hardest parts myself. I would not mind waiting 10-15 minutes for the right 20 lines of code!

sshine · 2026-05-20T19:24:34 1779305074

Why do you use Haskell? Why not something that produces a more predictable memory use at runtime? (I’m asking earnestly as a former Haskeller turned Rustacean who sees the value in “Boring Haskell”, but favours strictness for anything internet-facing and many things that aren’t compilers.)

black_knight · 2026-05-20T20:13:47 1779308027

I use Haskell because purity and strong typing gives so much control over what each part of the program does. This has huge benefits when it comes to security, and just general lack of bugs. Also, it makes the code easy to write, once the types are in place.

I use Haskell because I find laziness to be a super power. I can solve so many problems in the most straightforward way, and then laziness saves my butt w.r.t. performance.

I use Haskell because it is a better C than C is. The foreign function interface is brilliant, and I can take C primitives and apply all the abstraction mechanisms from Haskell to them. My latest project has been OpenGL based, so lots of caring about byte alignments and shovelling data to the GPU. But all this can be automated with clever use of type classes and Generics (Haskells super cool meta system of data types.)

I use Haskell because I love applying abstractions to make code which describes the problem, and then the compiler finds the solution.

I don’t do programming for embedded, so I am rarely memory constrained. I also understand Haskell memory usage quite well, and can get myself out of trouble.

tekacs · 2026-05-20T19:04:34 1779303874

Google's 3.5 Flash – which came out yesterday – is 200-300 tokens/second (albeit purportedly inefficient in its use of reasoning tokens) and according to Google, 800-1500+ tokens/second on their 8i TPUs when they're out!

It's... suboptimal, but hopefully that's a reason to hope... if Google get themselves together for 3.5 Pro / the next Flash.

c7b · 2026-05-20T18:48:33 1779302913

Do you have ideas/suggestions for agentic workflows that only start making sense at such speeds?

colechristensen · 2026-05-20T19:09:51 1779304191

Branching strategies, do 10 things in parallel and evaluate for the best at the end or something along the lines of an evolutionary algorithms. Turn up the temperature on an LLM and have a survival mechanism, and generate solutions to the same problem over and over.

c7b · 2026-05-20T22:04:03 1779314643

Regarding the first, parallel requests to the same loaded model seem to work pretty well, I'm trying to find time to look more into it myself, but this may be something that might already be within reach for local models.

colechristensen · 2026-05-21T03:25:23 1779333923

Sure, it's possible, but you'd start to use it much more and in more advanced ways. Like "thinking hard" would consist of spawning a dozen different inferences from the same cached point and then picking the best one.

ohadron · 2026-05-20T20:08:18 1779307698

Obviously things will get expensive quick, but the main thing for me would be not dealing with the context switch every time I leave the agent to do stuff on it's own.

Feedback loops for prototyping could become even quicker.

dandaka · 2026-05-21T14:02:23 1779372143

In my experience, current agentic workflows are so slow, that for many cases it only makes sense to run them in parallel. So a lot of context switching. If we could have 10-100× faster token generation, we could have task delivery at the speed of human review.

8note · 2026-05-20T18:36:16 1779302176

i really want a qwen on one of these chips: https://chatjimmy.ai

15k tokens/s would get me feeling like its actually worth splitting out worktrees to try several approaches to a problem

Cerium · 2026-05-20T18:43:16 1779302596

Why is that? It seems the other direction? I want to be sure I can complete a task in a certain amount of wall clock time. If the tokens per second are slow, then I am risking more by running a single approach at a time, and then have an incentive to try to multiplex my attention between separate work-streams. If the generation is fast enough to occupy my attention then there is no more available improvement by having parallel threads.

philipp-gayret · 2026-05-20T18:29:02 1779301742

If you have a Cerebras Code subscription you can experience it right now. Indeed, a very different experience.

KronisLV · 2026-05-20T18:36:32 1779302192

Used them for a while! They didn't seem to have prompt caching so I burnt through the daily 24M token limitations really quickly when doing large scale changes on a codebase (essentially a team's worth of menial migration/refactoring work). A lot of it was okay, but plenty had to be re-done and I still spotted some issues months down the line, in part I blame their model catalogue which did get an update to GLM 4.7 sometime way back, but definitely is showing its age: https://inference-docs.cerebras.ai/models/overview

Quality wise, Anthropic gives me the best results (Opus for almost everything, I make sub-agents with fresh context review its work, after 2-10 loops, usually finds most issues). Token amount wise for agentic work, DeepSeek V4 is up there. What Cerebras is doing pretty cool though, apparently they even have prompt caching now like the other big providers: https://inference-docs.cerebras.ai/capabilities/prompt-cachi... At the same time, producing bad code faster was annoying in a uniquely new way.

Wish they'd update the models with their subscription, it could genuinely be great with the proper harness. Like if they can run GLM 4.7, surely they could at least get DeepSeek V4 Flash with a big context window going as a starting point. How can you have so much money to make your own chips, but can't run modern models that you can get for free? It's like they don't want people to use their subscription.

cactusplant7374 · 2026-05-20T20:08:33 1779307713

Have you tried Codex? If you have, how does it compare to Opus?

KronisLV · 2026-05-20T21:13:00 1779311580

Codex is pretty good, OpenAI models are up there with Anthropic's, though I still prefer the latter for most development tasks (in part UI/UX, in part personal preference for how the model performs and interacts with me and the codebases). That said, if you do get a subscription from OpenAI, they actually have more generous usage limits than Anthropic - Anthropic's Pro tier is borderline useless for agentic development and I just went with their 100 USD Max tier instead. OpenAI might be more cost effective, though GPT-5.5 is more expensive than GPT-5.4, for example.

I'm recently also considering downgrading to Pro and using DeepSeek V4 Pro for anything but the more complex tasks and basically wrote a little utility to hook Claude Code up with 3rd party providers better: https://ccode.kronis.dev/ or tbh I could also just use OpenCode on the CLI or maybe something like KiloCode in Visual Studio Code (sadly RooCode got retired, liked their UI/UX a lot too).

I guess where I'm going with all this is that most of the SOTA or near-SOTA models are pretty okay and if you want, you should either get their more affordable plans for a month and experiment, or maybe hook up whatever tools you have with something like OpenRouter and try out a bunch of them: https://openrouter.ai/ (though some of their providers quantize the models a lot, look out for that) Personally I'd also add the new Kimi and GLM models to the list of the ones to try out.

Paying for API tokens isn't really financially good long term for anyone but companies and eventually most folks just settle on a subscription of some sort, since those are heavily subsidized and more cost effective.

dkersten · 2026-05-20T18:35:20 1779302120

It’s GLM 4.7, GPT OSS 120B, or llama 3.1 8B so not exactly the latest or best models.

But GLM is good enough for many small tasks, certainly enough to get a taste for Cerebras’ high speeds!

[edit: actually that’s just their general models, I can’t see what Cerebras code offers. It was Qwen-coder when it launched but I don’t know what it is now. I think GLM 4.7 but I’m not completely sure]

philipp-gayret · 2026-05-20T20:27:26 1779308846

> It was Qwen-coder when it launched but I don’t know what it is now.

This was also what I used at the time, the Qwen 3 Coder 480b on Cerebras. Worked great and was so stupidly fast it made me realize that if the hardware can be at that level and commercially available (say in a 5~10 years), for that price, then we will have entirely new bottlenecks. Human review at the pace it was going is completely impossible.

ohadron · 2026-02-04T06:10:41 1770185441

The maximum theoretical size for a zip archive is 16 exabytes (2^64 bytes). It's free if you have where to store it.

Someone · 2026-02-04T13:12:39 1770210759

Should be doable on consumer hardware nowadays, if you cheat by using a file system that either supports sparse files (https://en.wikipedia.org/wiki/Sparse_file) or block-level deduplication (https://en.wikipedia.org/wiki/Data_deduplication). You may need to use raw block I/O to create such file, and there will be lots of duplicated content in the archive.

Also: how hard is that limit? ZIP archives have their TOC at the end of the file and allow for inserting ‘junk’ that is never referenced in the ZIP’s table of contents. Isn’t it possible to add such junk to make an archive go over that limit (assuming that your file system allows files larger than 2⁶⁴ bytes)?

pugworthy · 2026-02-04T06:36:01 1770186961

The problem is once you zip them to full compression, you really can't use them ever again. That is unless you get the good ones that let you technically unzip without requiring destruction.

ohadron · 2025-11-13T11:09:42 1763032182

But why

ohadron · 2025-10-23T19:54:54 1761249294

> Each step moves further from "how do we build better models?" toward "how do we monetize the models we have?"

I don't think OpenAI launching ChatGPT Apps and Atlas signals they're pivoting.

It's just that when you raise that much money you must deploy it in any possible direction.

ohadron · 2025-08-31T14:37:29 1756651049

LLMs would be amazing for this

recursive · 2025-08-31T22:00:15 1756677615

I wouldn't put an LLM in the loop for anything that has security implications.

protocolture · 2025-09-01T09:25:42 1756718742

Enigma 2.0 getting cracked due to the prevalence of the em dash.

ohadron · 2025-07-01T13:30:00 1751376600

This is a terrific idea and could also have a lot of value with regards to accessibility.

taco_emoji · 2025-07-01T20:46:00 1751402760

The problem, as always, is that LLMs are not deterministic. Accessibility needs to be reliable and predictable above all else.

ohadron · on May 27, 2025

Took a while but Wix / Webflow / SquareSpace / Wordpress did end up automating a bunch of work.

JimDabell · on May 27, 2025

They did, but do you think there are more or fewer web development jobs now compared with the 90s?

tonyedgecombe · on May 27, 2025

There is a whole lot of brochure type web work that has disappeared, either to these site builders or Facebook. I don't know what happened to the people doing that sort of work but I would assume most weren't ready to write large React apps.

JimDabell · on May 27, 2025

Why are you assuming that? How do you think all the new React jobs were filled? React developers don’t magically spring into existence with a full understanding of React out of nowhere, they grow into the job.

ohadron · on May 27, 2025

The web developer / web page ratio in 2025 is for sure way lower than it was in 1998.

JimDabell · on May 27, 2025

Why should anybody care about that metric? People care about jobs.

usersouzana · on May 27, 2025

More, but that doesn't say anything about the future.

JimDabell · on May 28, 2025

Sure it does. It’s not a guarantee, but presuming that a pattern is likely to continue is not nothing. When a pattern is observed, the onus is on the “This time is different!” side to make their case.

usersouzana · on May 28, 2025

The aim of AI is to automate almost everything. That sounds like a future quite different from any past.

JimDabell · on May 28, 2025

If you don’t accept that an observed trend says anything about the future, you shouldn’t make unsupported assertions in the opposite direction. They say less.

usersouzana · on May 28, 2025

AI aiming to automate everything is something new. That's the point. There was no AI in the past similar to what is slowly unfolding now. Not even close. If you disagree with the word "anything" i used, then yes i understand i shouldn't have used this word.

ohadron · on May 7, 2025

For one thing, it's way faster than the OpenAI equivalent in a way that might unlock additional use cases.

freedomben · on May 7, 2025

Speed has been the consistent thing I've noticed with Gemini too, even going back to the earlier days when Gemini was a bit of a laughing stock. Gemini is fast

julianeon · on May 7, 2025

I don't know exactly the speed/quality tradeoff but I'll tell you this: Google may be erring too much on the speed side. It's fast but junk. I suspect a lot of people try it then bounce off back to Midjourney, like I did.

ohadron · on April 10, 2025

That’s actually a good prompt

bsimpson · on April 10, 2025

"Now that you've been promoted, you don't build CRUD tools anymore. Those are below your level. Instead, you build AI agents that build the CRUD tools."

ohadron · on July 14, 2024

Took a bit to run but now my iPhone feels much faster. Thanks!