Having the productivity "drop through the floor" is a bit hyperbolic, no? Humans are still reviewing the PRs before code merge at least at my company (for the most part, for now).
I don't know that it's likely but it's certainly a plausible outcome. If tooling keeps getting built for this and the financial music stops it's going to take a while for everybody to get back up to speed
Remember this famously happened before, in the 1970s
There's an actual working product now, albeit one which is currently loss leading. In software world at least there is definitely enough value for it to be used even if it's just better search engine. I'm not sure why it would disappear if the financial music stops as opposed to being commoditised.
Because there's cheaper ways to get an equally good search engine? But yes I imagine some amount of inference will continue even in an AI Winter 3.0 scenario.
They are killing off their last and best generation of men, so yes the economy will suffer. I'm not questioning that part -- it's the repeated "russia will collapse any minute" propaganda, going on for 12+ years, that is very easy to see through.
Yeah seriously. Don't people understand the fact that society is not good at mopping up messes like this—there has been a K shaped economy for several decades now and most Americans have something like $400 in their bank accounts. The bottom had already fallen out for them, and help still hasn't arrived. I think it's more likely that what really happens is that white collar workers, especially the ones on the margin, join this pool—and there is a lot of suffering for a long time.
Personally, rather devolving into nihilism, I'd rather try to hedge against suffering that fate. Now is the time to invest and save money. (or yesterday)
If white collar workers as a whole suffer severe economic setback over a short term timespan, your savings and investments won’t help you.
Unless you’re investing in guns, ammo, food, and a bunker. We’re talking worse unemployment than depression era Germany. And structurally more significant unemployment because the people losing their jobs were formally very high earners.
That’s the cataclysmic outcome, though. Although I deemed that that’s certainly possible and I would put a double digit percentage probability on it, another very likely outcome is a very severe recession, or a recession, wear a lot of, but not all, white collar work is wiped out. Maybe there’s a significant restructuring in the economy I think in a scenario like that, which also seems to be in the realm of possibility, I think having resources still matters. Speech to text, sorry for the poor grammar.
It’s definitely possible that there’s an impact that is bad but not cataclysmic. I figure in thst case though my regular savings is enough to switch to something else. I could retire now if I was willing to move somewhere cheap and live on $60k a year. There’s a lot of things that could cause that level of recession though without the need for AI.
I do also think the mid level bad outcome isn’t super likely because of AI is good enough to replace a lot of white collar jobs, I think it could replace almost all of them.
> Our latest frontier models have shown particular strengths in their ability to do long-running tasks, working autonomously for hours, days or weeks without intervention.
I have yet to see this (produce anything actually useful).
I've been finding that the Opus 4.5/4.6 and GPT-5.2/5.3 models really have represented a step-change in how good they are at running long tasks.
I can one-shot prompt all sorts of useful coding challenges now that previously I would have expected to need multiple follow-ups to fix mistakes the agents made.
No, not for days - but it churned away on that one for about ten minutes.
I don't think I've got any examples of multi-hour or multi-day sessions that ran completely uninterrupted - this one back in December took 4.5 hours but I had to prompt it to keep going a few times along the way: https://simonwillison.net/2025/Dec/15/porting-justhtml/
Maybe so, but I did once spend 12 hours straight debugging an Emscripten C++ compiler bug! (After spending the first day of the jam setting up Emscripten, and the second day getting Raylib to compile in it. Had like an hour left to make the actual game, hahah.)
I am a bit thick with such things, but just wanted to provide the context that Emscripten can be a fickle beast :)
I sure am glad I can now deploy Infinite Mechanized Autistic Persistence to such soul-crushing tasks, and go make a sandwich or something.
(The bug turned out to be that if I included a boolean in a class member, the whole game crashed, but only the Emscripten version. Sad. Ended up switching back to JS, which you basically need anyway for most serious web game dev.)
How do you deal with the cost associated with a long running opus session? I asked it to validate some JSON configs against the spec yesterday and it burned $10 worth of tokens for what would have been a 1 millisecond linter task.
If you look through the commit logs on simonw/research and simonw/tools on GitHub most commits should either list the prompt, link to a PR with the prompt or link to a session transcript.
I routinely leave codex running for a few hours overnight to debug stuff
If you have a deterministic unit test that can reproduce the bug through your app front door, but you have no idea how the bug is actually happening, having a coding agent just grind through the slog of sticking debug prints everywhere, testing hypotheses, etc — it's an ideal usecase
I have a hard time understanding how that would work — for me, I typically interface with coding agents through cursor. The flow is like this: ask it something -> it works for a min or two -> I have to verify and fix by asking it again; etc. until we're at a happy place with the code. How do you get it to stop from going down a bad path and never pulling itself out of it?
The important role for me, as a SWE, in the process, is verify that the code does what we actually want it to do. If you remove yourself from the process by letting it run on its own overnight, how does it know it's doing what you actually want it to do?
Or is it more like with your usecase—you can say "here's a failing test—do whatever you can to fix it and don't stop until you do". I could see that limited case working.
For some reason setting up agents in a loop with a solid prompt and new context each iteration seems to result in higher quality work for larger or more difficult tasks than the chat interface. It's like the agent doesn't have to spend half its time trying to guess what you want
Its constantly restarting itself, looking at the current state of things, re-reading what was the request, what it did and failed at in the past (at a higher level), and trying again and again.
I don't even necessarily ask it to fix the bug — just identify the bug
Like if I've made a change that is causing some unit test to fail, it can just run off and figure out where I made an off-by-one error or whatever in my change.
I've heard this said a lot but never had this problem. Claude has been decent at debugging tests since 4.0 in my experience (and much better since 4.5)
it's more like "this function is crashing with an inconsistent file format error. can you figure out how a file with the wrong format got this far into the pipeline?". in cases like that the fix is usually pretty easy once you have the one code path out of several thousands nailed down.
Or, they have freed up time for more useful endeavours, that may otherwise have spent on drudgery.
I don't discount the value of blood, sweat and tears spent on debugging those hard issues, and the lessons learned from doing so, but there is a certain point where it's OK to take a pass and just let the robots figure it out.
It's easy to say that these increasingly popular tools are only able to produce useless junk. You haven't tried, or you haven't "closed the loop" so that the agent can evaluate its own progress toward acceptance criteria, or you are monitoring incompetent feeds of other users.
I'm definitely bullish on LLM's for coding. It sounds to me as though getting it to run on its own for hours and produce something usable requires more careful thought and setup than just throwing a prompt at it and wishing for the best—but I haven't seen many examples in the wild yet
Strategy -> [ Plan -> [Execute -> FastVerify -> SlowVerify] -> Benchmark -> Learn lessons] -> back to strategy for next big step.
Claude teams and a Ralph wiggum loop can do it - or really any reasonable agent. But usually it all falls apart on either brittle Verify or Benchmark steps. What is important is to learn positive lessons into a store that survives git resets, machine blowups, etc… Any telegram bot channel will do :)
The entire setup is usually a pain to set up - docker for verification, docker for benchmark, etc… Ability to run the thing quickly, ability for the loop itself to add things , ability to do this in worktree simultaneously for faster exploration - and got help you if you need hardware to do this - for example, such a loop is used to tune and custom-fuse CUDA kernels - which means a model evaluator, big box, etc….
I am currently porting pyte to Go through a similar approach (feeding the LLM with a core SPEC and two VT100/VT220 test suites). It's chugging along quite nicely.
Anthropic is actually sort of concerned with not burning through cash and charging people a reasonable price. Open AI doesn’t care. I can use Codex CLI all day and not approach any quotas with just my $20 a month ChatGPT subscription.
I treat coding agents like junior developers and never take my hand off the wheel except for boilerplate refactoring.
The other day I got Codex to one-shot an upgrade to Vite 8 at my day job (a real website with revenue). It worked in this for over 3 hours without intervention (I went to sleep). This is now in production.
(but honestly for a lot of websites and web apps you really can just send it, the stakes are very low for a lot of what most people do, if they're honest with themselves)
I find this absolutely wild. From my experience Codex code quality is still not as good as a human so letting codex do smth and not verifying / cleaning up behind it will most likely result in lower code quality and possibly subtle bugs.
For upgrading frameworks and such there are usually not that many architectural decisions to be made, where you care about how exactly something is implemented. Here the OP could probably verify the build works, with all the expected artifacts quite easily.
Agreed. Optimistically let it resolve merge conflicts in an old complex branch. Looked fine at first but was utter slop upon further review. Duplication, wildly unnecessary complexity and all.
I understand why HN doesn't want to devolve into a political forum—but at it's spirit, HN is supposed to cover topics that "...are of interest to those working in the tech community". The upvotes on a thread like this demonstrates that these are topics that are indeed of interest—so I wish that there was more of an appetite to allow these discussions to play out. Maybe having a limit on the number of posts per day or per week that could make it to the frontpage could give everyone a bit more of what they want.
Personally, the political threads on HN are the ones in which I learn the most by and large. There simply isn't another community on the web that elicits such thought provoking discussion around these types of issues—reddit doesn't even come close. I hope the policy will change in the future; especially during these tumultuous times, but I wouldn't hold my breath.
...Highly disagree. China can (and has) manipulate the hearts and minds of the American public—skewing their biases in a way that creates internal chaos and dissent, disrupting institutional order, and sewing distrust of thy neighbor. They've been doing this for at least a decade now, and have played a silent hand in reshaping American politics. If (when) a conflict arises, trust that they will use this tool to manipulate the electorate in a way that benefits them in a zero sum way.
>China can (and has) manipulate the hearts and minds of the American public—skewing their biases in a way that creates internal chaos and dissent, disrupting institutional order, and sewing distrust of thy neighbor
Nothing a tin-foil hat can't prevent
As if the public needed any manipulation. You can just read what actual public figures, journalists, and such have been openly saying for the last 15-20 years...
When a long-time political player, wife of a President, and presidential candidate calls a big chunk of the population "deplorables", when opposing candidates call for the jailing or even shooting of their opponent, or when the current President is saying what he says and doing what he does, you need more to get "chaos" and "distrust of the neighbor"?
No tin-foil hat needed. There is published research documenting that this is happening [0] on certain topics and there is a lot of reason to believe it is happening in others. Yes, I'm not saying China is the only source of the state of our domestic discontent; but it's fuel to the fire and will be used against us at times in the future when we need national cohesion. See also [1] a 60 minutes episode on a related thread of China infiltrating the US in other ways.
Two things can be true at the same time. It's not just China. Russia does it. So does Israel, many countries, as well as many institutions in the US, both public and private.
It's no secret they do this, they openly discuss it. The things they want can differ but they want to convince you of things. That's obvious.
What makes the game easy for political adversaries (both foreign and domestic) is they don't need to convince the public of a certain thing, they just drive contention. What many people call "engagement". You can see this in 2016 with Russia doing things like forming Facebook groups to spur on protests along with groups to organize counter protests to the protest they helped create. They're not trying to make you pro Russia or pro communist so much as just cause America to be chaotic, ensuring people care less when they do things like invade Ukraine. You also see it in the current administration which, developing the belief in a deep state and saying crazy things left and right so that nothing is to be believed and you're constantly distracted. While we're all talking about Greenland we're not talking about Epstein. Every week it's something new. Even Bannon discussed this strategy early on: throw a million things at them and they'll only be able to focus on a few. It's no surprise this creates chaos and confusion. We argue about the things not being discussed as if it's hidden information rather than logistic overload but there's also not a meaningful difference
The point isn't to convince you, it's to make you exhausted and apathetic
Damn, imagine if an Australian or a South African billionaire did that with big media companies, oh well, that's just a weird thought, nothing to take from that.
> skewing their biases in a way that creates internal chaos and dissent, disrupting institutional order, and sewing distrust of thy neighbor.
I don't really have respect for this idea; we do this to ourselves far more effectively than people who frankly have a pretty hamfisted cultural understanding- just as we have of china or russia.
IMO influence over real concrete choices is much more alarming. Someone with household-level information has an insane amount of advantage in an election. You can target politcal messaging street by street to play up the worst aspects of your opposed candidate and the least repulsive aspects of your own candidate.
But if you're in china, the most you can do is try to push towards whatever of the two candidates is least bad for you. And spoiler, zero american politicians are pro-china.
This is the difficulty with propaganda- you have to tailor it to a foreign audience but then the message is changed.
America has been trying to spread it's way of life for a hundred years. People liked the fridges and cars but never cared much for the Christianity and croony capitalism.
> And spoiler, zero american politicians are pro-china.
..Other than, well possibly, Trump. Maybe not directly, but the Tiktok deal, withdrawing from the TPP, the eventual outcome of the trade war, the praise for Xi—all stands to benefit China at the expense of the US.
> I don't really have respect for this idea; we do this to ourselves far more effectively than people who frankly have a pretty hamfisted cultural understanding- just as we have of china or russia.
Outside competition allows progress because we have been shown time and time again that the US will just not solve its problems without outside pressure. I'd also argue that any other country in its position would act the same. For example when the USSR was actively competing with the US, they could easily lob a major criticism of the US in capturing 'hearts and minds' of other nations: "Look at how they treat their minorities. Do you really want to work with those people?"
Yes there were very active causes and groups in the US to correct this issue, but that outside pressure forced leadership to be nudged towards corrective action and I wonder if the USSR hadn't been there would we have gotten Civil Rights legislation passed when we did?
Maybe the same will happen with China showing the US how fast they can get stuff done and what they provide as benefits to their citizens vs a declining US. Already TikTok has helped Gen-Z realize how Israel gets so many benefits (universal healthcare, college tuition, benefits for birthing kids etc.) while the US is in massive debt and continues to send money to Israel. That continued propaganda may lead to an eventual backlash and subsequent reform.
US aid to Israel is about 0.05% of the federal budget, and around 3% of Israel’s state budget. That’s nowhere near enough to fund healthcare in either country.
I never said it did but the point is that it is money that does not need to go there when we are trillions in debt already. For years whenever any thought of providing some relief to the middle class in the US is brought up, the response is always "How are you going to pay for that?". How come that question is never asked for outlays to israel but only when it involves the American people?
> "...Passive activities such as watching television have been linked to worse memory and cognitive skills, while ‘active sitting’ like playing cards or reading correlate with better brain health, researchers have found."
...Do these researchers even read this to themselves aloud before hitting publish? It's confounding that they would find "sitting" to be the active ingredient pushing the outcome differential. Obviously, if you remove the bodily posture from the action that the user is engaging in, you would observe the same outcome the researchers did—meaning sitting was not operative here (..duh).
Breaking news at 11: the brain works best when it’s actually used.
Good luck. Amazon banned my seller account 7 years ago because my wife, who was also an amazon seller, used our shared CC (which has my name on it, although she's an authorized user) to pay her $45/month seller fee. The account had $48,000 in it at the time I was banned, and I was never able to get the money back; after an endless number of hours of pleads with their teams, mails to jeff@amazon, working on it from the inside, etc. etc. Be happy that your financial loss was limited.
edit: I posted about it on HN at the time [1]. Apparently looks like at that time I thought I was delisted for a bad review. To be honest, I still don't know why I was delisted, because at least at that time, Amazon would refuse to tell you why you were delisted. You just had to come up with reasons why you may have been, submit an appeal, and then they would come back to you with "sorry, that's not a sufficient appeal". So then you'd have to come up with another reason why you may have been delisted and try to submit another appeal (which itself was a grueling process, for which you would have to wait days/weeks for a response). It was beyond baffling as to why they would operate in that way; it was as if they were trying their absolute hardest to immiserate sellers in the most draconian and malevolent way possible. It was that bad. It was unbelievable to me at the time, and still today, that they could treat their sellers that badly. Yeah, fuck amazon. Seriously.
Amazon has some terms in their TOS that you have to go through mediation on their terms—I tried to get the process started but could never get them to respond when I/we contacted their legal teams. I probably should have pressed harder, I'm sure there was some way to do it, but I wasn't able to figure it out at the time.
I feel like if you've made a good-faith effort to start the arbitration process they require you to do, and they ignore you, that is grounds for a lawsuit. And I doubt a judge would look favorably upon Amazon in that case.
Good luck for the lawsuit. I read your story and I read some other horror stories in here as well.
How is it even legal that they can withold your 40_000$ for something like 45$ like its your money, it feels so blackmirror and sad :< I hope you are doing okay right now man.
I never understand what balls these companies have in making the customer's life hell when the bills are so low. I remember a guy from HN some time ago where Azure made them unable to pay because of an unpaid bill and they literally did so many shit to wanting to pay but can't, the bill was 20$ and the frustrated user actually I think worked at large company and started either migrating multi million $ worth of yearly deals to AWS (in this case from Azure) (personally I feel like aws is ass too but in that case better than azure, personally prefer hetzner though not a 1:1 comparison)
One of the reasons why I love companies with good support system (preferably small). So that such stupidity can be stopped & they can have common sense unlike Amazon in this case.
The TOS is worded in such a way that it's almost impossible for them to lose. The best legal minds who are paid 7 figures write these things to be impenetrable. Arbitration is slow, time consuming, likely to not lead to desired outcome. Retain a lawyer just means more money and time down drain, and these companies laugh at legal threats, knowing it it ever got that far they would still win either getting the case dismissed or attrition.
That was basically my sentiment and the sentiment of a few lawyers I consulted for it. I lost so much money in the entire process overall it hurts to think about.
I feel this is def solvable these days -- but the 7 years thing is gonna be tough to overcome at this point.
What I've done on some of these "need to escalate to a human" issues is to buy a ticket to Amazon Accelerate (in Seattle every September), book a Seller Cafe appointment to talk to a leadership team person (I think recently got moved to the captive escalations department), and get someone to talk to face to face.
I know it sounds dumb but I've solved issues that were costing my company 7fig/year sales like this.
I live in Seattle, and in WA state that would put it beyond the limit for a small claims case. Idk, I tried to contact lawyers to take the case and they sent letters to Amazon which Amazon never responded to. I was also going through a pretty serious health battle at the time and so couldn't really devote full attention to it. Now that things are a bit better, I feel like it's so far in the past that I don't know if I would have a reasonable claim so I sort of just let it be.
Maybe not for very broad definitions of OS state, but for specific files/folders/filesystems, this is trivial with FS-level snapshots and copy-on-write.
Let's assume that you can. For disaster recovery, this is probably acceptable, but it's unacceptable for basically any other purpose. Reverting the whole state of the machine because the AI agent (a single tenant in what is effectively a multi-tenant system) did something thing incorrect is unacceptable. Managing undo/redo in a multiplayer environment is horrific.
I wonder if in the long run this will lead to the ascent of NixOS. They seem perfect for each other: if you have git and/or a snapshotting filesystem, together with the entire system state being downstram of your .nix file, then go ahead and let the LLM make changes willy-nilly, you can always roll back to a known good version.
NixOS still isn't ready for this world, but if it becomes the natural counterpart to LLM OS tooling, maybe that will speed up development.
Well there is cri-u for what its worth on linux which can atleast snapshot the state of an application and I suppose something must be similar available for filesystems as well
Also one can simply run a virtual machine which can do that but then the issue becomes in how apps from outside connect to vm inside
Ok, you can "easily", but how quickly can you revert to a snapshot? I would guess creating a snapshot for each turn change with an LLM become too burdensome to allow you to iterate quickly.
You're all referencing the strange idea in a world where there would be no open-weight coding models trained in the future. Even in a world where VC spending vanished completely, coding models are such a valuable utility that I'm sure at the very least companies/individuals would crowdsource them on a reoccurring basis, keeping them up to date.
The value of this technology has been established, it's not leaving anytime soon.
I think faang and the like would probably crowdsource it given that they would—according to the hypothesis presented—would only have to do it every few years, and ostensibly are realizing improved developer productivity from them.
I don’t think the incentive to open source is there for $200 million LLM models the same way it is for frameworks like React.
And for closed source LLMs, I’ve yet to see any verifiable metrics that indicate that “productivity” increases are having any external impact—looking at new products released, new games on Steam, new startups founded etc…
Certainly not enough to justify bearing the full cost of training and infrastructure.
reply