Because humans also make stupid random mistakes, and if your test suite and defensive practices don't catch it, the only difference is the rate of errors.
It may be that you've done the risk management, and deemed the risk acceptable (accepting the risk, in risk management terms) with human developers and that vibecoding changes the maths.
But that is still an admission that your test suite has gaping holes. If that's been allowed to happen consciously, recorded in your risk register, and you all understand the consequences, that can be entirely fine.
But the problem then isn't reflecting a problem with vibe coding, but a risk management choice you made to paper over test suite holes with an assumed level of human dilligence.
If the failure mode is invisible, that is a huge risk with human developers too.
Where vibecoding is a risk, it generally is a risk because it exposes a systemic risk that was always there but has so far been successfully hidden, and reveals failing risk management.
i agree, and its strange that this failure mode continually gets lumped onto AI. The whole point of longer term software engineering was to make it so that the context within a particular persons head should not impact the ability of a new employee to contribute to a codebase. turns out everything we do to make sure that is the case for a human also works for an agent.
As far as i can tell, the only reason AI agents currently fail is because they dont have access to the undocumented context inside of peoples heads and if we can just properly put that in text somehwere there will be no problems.
Even if the models stopped getting better today, we'd still see many years of improvements from improving harnesses and understanding of how to use them. Most people just talk to their agent, and don't e.g. use sub-agents to make the agent iterate and cross-check outcomes for example. Most people who use AI would see a drastic improvement in outcomes just by experimenting with the "/agents" command in Claude Code (and equivalent elsewhere). Much more so with a well thought out agent framework.
A simple plan -> task breakdown + test plan -> execute -> review -> revise (w/optional loops) pipeline of agents will drastically cut down on the amount of manual intervention needed, but most people jump straight to the execute step, and do that step manually, task by task while babysitting their agent.
This is definitely the case. I have a project that while not wildly profitable yet, is producing real revenue, but that I will not give details of because the moat is so small. The main moat is that I know the potential is real, and hopefully not enough other people do, yet. I know it will disappear quickly, so I'm trying to make what I can of it while it's there. I may talk about it once the opportunity is gone.
It involves a whole raft of complex agents + code they've written, but that code and the agents were written by AI over a very short span of time. And as much as I'd like to stroke my own ego and assume it's one of a kind, realistically if I can do it, someone else can too.
Because very few knows how to use AI. I teach AI courses on the side. I've done auditing supervised fine tuning and RLHF projects for a major provider. From seeing real prompts, many specifically from people who work with agents every day, people do not yet have the faintest clue how to productively prompt AI. A lot of people prompt them in ways that are barely coherent.
Even if models stopped improving today, it'd take years before we see the full effects of people slowly gaining the skills needed to leverage them.
You'd be surprised how low the bar is. What I'm seeing is down to the level of people not writing complete sentences.
There doesn't need to be any "magic" there. Just clearly state your requirements. And start by asking the model to plan out the changes and write a markdown file with a plan first (I prefer this over e.g. Claude Code's plan mode, because I like to keep that artefact), including planning out tests.
If a colleague of yours not intimately familiar with the project could get the plan without needing to ask followup questions (but able to spend time digging through the code), you've done pretty well.
You can go over-board with agents to assist in reviewing the code, running tests etc. as well, but that's the second 90%. The first 90% is just to write a coherent request for a plan, read the plan, ask for revisions until it makes sense, and tell it to implement it.
You're right, they should know better, but I think a lot of them have gotten away with it because most of them are not expected to produce written material setting out missing assumptions etc. and breaking down the task into more detail before proceeding to work, so a lot have never gotten the practice.
Once people have had the experience of being a lead and having to pass tasks to other developers a few times, most seem to develop this skill at least to a basic level, but even then it's often informal and they don't get enough practice documenting the details in one go, say by improving a ticket.
Not surprising. Many folks struggle with writing (hence why ChatGPT is so popular for writing stuff), so people struggling to coherently express what they want and how makes sense.
But the big models have come a long way in this regard. Claude + Opus especially. You can build something with a super small prompt and keep hammering it with fix prompts until you get what you want. It's not efficient, but it's doable, and it's much better than having to write a full spec not half a year ago.
> Claude + Opus especially. You can build something with a super small prompt and keep hammering it with fix prompts until you get what you want.
LOL: especially with Claude this was only in 1 out of 10 cases?
Claude output is usually (near) production ready on the first prompt if you precisely describe where you are, what you want and how you get it and what the result should be.
This is exactly it. A lot of people use it that way. And it's still a vast improvement, but they could also generally do a lot better with some training. I think this is one of the areas where you'll unfortunately see a big gap developing between developers who do this well, and have the models work undisturbed for longer and longer while doing other stuff, and those who ends up needing a lot more rework than necessary.
The irony is that while from perfect, an LLM-based fact-checking agent is likely to be far more dilligent (but still needs human review as well) by nature of being trivial to ensure it has no memory of having done a long list of them (if you pass e.g. Claude a long list directly in the same context, it is prone to deciding the task is "tedious" and starting to take shortcuts).
But at the same time, doing that makes it even more likely the human in the loop will get sloppy, because there'll be even fewer cases where their input is actually needed.
I'm wondering if you need to start inserting intentional canaries to validate if humans are actually doing sufficiently torough reviews.
One of the effects of communicating this way is that people who are not operating in good faith will tend to quickly out themselves, and often getting them to do that is enough.
It may be that you've done the risk management, and deemed the risk acceptable (accepting the risk, in risk management terms) with human developers and that vibecoding changes the maths.
But that is still an admission that your test suite has gaping holes. If that's been allowed to happen consciously, recorded in your risk register, and you all understand the consequences, that can be entirely fine.
But the problem then isn't reflecting a problem with vibe coding, but a risk management choice you made to paper over test suite holes with an assumed level of human dilligence.
reply