(Disclaimer: Am on the Codex team.)
We're basically trying to build a teammate that can do both short, iterative work with you, then as you build trust (and configuration), you can delegate longer tasks to it.
I really wish model performance messaging and benchmarks were more focused on perfecting short, iterative tasks instead of long-running work.
As a startup founder and engineer, I'm not constrained by the number of 10000+ line diff, 0->1 demos I can ship. I'm constrained by quality of the 100 -> 101, tight 150 line feature additions / code cleanups I can write.
It feels like the demos, funding, and hype all want to sell me entire PR rewrites, but what I need is the best possible iterative work model that will keep me in the loop.
I still use codex - but I use codex incredibly iteratively (give it very narrowly scoped tasks, and I watch it like a hawk, giving tons of feedback). I don't use it because of its ability to code for 24 hours. I use it because when I give it those narrowly scoped tasks, it is better at writing good code than any other model. (Because of its latency, I have 2-4 of these conversations going on at the same time).
But there is a lot of friction the codex product + model adds to this process. I have to prompt aggressively to override whatever "be extremely precise" prompting the model gets natively so that it doesn't send me 20+ bullet points of extraordinarily dense prose on every message. I have to carefully manage its handling of testing; it will widen any DI + keep massive amounts of legacy code to make sure functionality changes don't break old tests (rather than updating them) and to make sure any difficult tests can have their primary challenges mocked away.
In general, codex doesn't feel like an amazing tool that I have sitting at my right hand. It feels like a teenage genius who has been designed to do tasks autonomously, and who I constantly have to monitor and rein in.
Codex(-cli) is an outsourced consultant who refuses to say "I can't do that" and will go to extreme lengths to complete a task fully before reporting anything. It's not a "teammate".
It also doesn't communicate much while it's working compared to Claude. So it's really hard to interrupt it while it's making a mistake.
Also, as a Go programmer, the sandbox is completely crazy. Codex can't access any of the Go module caches (in my home directory) and it has to result to crazy tricks to bring them INSIDE the project directory - which it keeps forgetting to do (as the commands have to run with specific ENV_VARS) and just ... doesn't run tests for example, because it couldn't.
The only way I've found to make that problem go away is run it with the --omg-super-dangerous-give-every-permission-ever switch just so that it can do the basic work I need it to do.
Maybe give us an option between the ultra-safe sandbox that just refused to run "ps" 15 minutes ago to check if a process is running and the "let me do anything anywhere" option. Some sane defaults please.
The "# of model-generated tokens per response" chart in [the blog introducing gpt-5-codex](https://openai.com/index/introducing-upgrades-to-codex/) shows an example of how we're improving the model good at both.