More

amluto · 2026-02-02T01:21:26 1769995286

$200? Does this use reasoning? Does it involve forgetting to use KV caching?

This should cost well under $1. Process the prompt. Then, for each word, input that word and then the end of prompt token, get your one token of output (maybe two if your favorite model wants to start with a start-of-reply token), and that’s it.

amluto · 2026-02-01T20:32:11 1769977931

I'm hoping someone makes an agent that fixes the container situation, better:

> If you're uncomfortable with full access, run pi inside a container or use a different tool if you need (faux) guardrails.

I'm sick of doing this. I also don't want faux guardrails. What I do want is an agent front-end that is trustworthy in the sense that it will not, even when instructed by the LLM inside, do anything to my local machine. So it should have tools that run in a container. And it should have really nice features like tools that can control a container and create and start containers within appropriate constraints.

In other words, the 'edit' tool is scoped to whatever I've told the front-end that it can access. So is 'bash' and therefore anything bash does. This isn't a heuristic like everyone running in non-YOLO-mode does today -- it’s more like a traditional capability system. If I want to use gVisor instead of Docker, that should be a very small adaptation. Or Firecracker or really anything else. Or even some random UART connection to some embedded device, where I want to control it with an agent but the device is neither capable of running the front-end nor of connecting to the internet (and may not even have enough RAM to store a conversation!).

I think this would be both easier to use and more secure than what's around right now. Instead of making a container for a project and then dealing with installing the agent into the container, I want to run the agent front-end and then say "Please make a container based on such-and-such image and build me this app inside." Or "Please make three containers as follows".

As a side bonus, this would make designing a container sandbox sooooo much easier, since the agent front-end would not itself need to be compatible with the sandbox. So I could run a container with -net none and still access the inference API.

Contrast with today, where I wanted to make a silly Node app. Step 1: Ask ChatGPT (the web app) to make me a Dockerfile that sets up the right tools including codex-rs and then curse at it because GPT-5.2 is really remarkably bad at this. This sucks, and the agent tool should be able to do this for me, but that would currently require a completely unacceptable degree of YOLO.

(I want an IDE that works like this too. vscode's security model is comically poor. Hmm, an IDE is kind of like an agent front-end except the tools are stronger and there's no AI involved. These things could share code.)

CuriouslyC · 2026-02-02T01:17:36 1769995056

This is actually something I've been playing with. Containers/VMs managed by a daemon with lifecycles that an agent can invoke sessions on and execute commands in, using OPA/Rego over gRPC. The cherry on top is envoy for egress with whitelists and credential injection.

One cool thing is that you can run a vscode service on these containers and open the port up to the outside world, then code in and watch a project come to life.

amluto · 2026-01-31T20:16:43 1769890603

> The phrase "Chinese Mainland" when used in English comes loaded with the suggestion that Taiwan is rightfully part of China

For better or for worse, many people on both sides of the strait have used language along these lines that suggests that Taiwan is part of China for decades and probably even since a bit before 1949 (I was not alive at the time). I think that, at this point, the term “mainland China” is just the default.

That being said, a person from China could just say they’re from China and no one would be confused. This is in contrast to someone saying they’re Chinese, which can be ambiguous.

amluto · 2026-01-31T04:40:04 1769834404

If they were taking that approach, they would have absolutely first-class integration between AI tools and user data, complete with proper isolation for security and privacy and convenient ways for users to give agents access to the right things. And they would bide their time for the right models to show up at the right price with the right privacy guarantees.

I see no evidence of this happening.

irishcoffee · 2026-01-31T07:52:43 1769845963

As an outsider, the only thing the two of you disagree on is timing. I probably side with the ‘time is running out’ team at the current juncture.

amluto · 2026-01-31T04:31:17 1769833877

It’s probably not really related, but this bug and the saga of OpenAI trying and failing to fix it for two weeks is not indicative of a functional company:

https://github.com/openai/codex/issues/9253

OTOH, if Anthropic did that to Claude Code, there wasn’t a moderately straightforward workaround, and Anthropic didn’t revert it quickly, it might actually be a risk-the-whole-business issue. Nothing makes people jump ship quite like the ship refusing to go anywhere for weeks while the skipper fumbles around and keeps claiming to have fixed the engines.

Also, the fact that it’s not major news that most business users cannot log in to the agent CLI for two weeks running is not major news suggests that OpenAI has rather less developer traction than they would like. (Personal users are fine. Users who are running locally on an X11-compatible distro and thus have DISPLAY set are okay because the new behavior doesn’t trigger. It kind of seems like everyone else gets nonsense errors out of the login flow with precise failures that change every couple days while OpenAI fixes yet another bug.)

trhway · 2026-01-31T05:01:32 1769835692

I don't know what you're so surprised about. The ticket reads like any other typical [Big] enterprise ticket. UI works, headless - not (headless is what only hackers use, so not a priority, etc.) Oh, found the support guy who knows what headless is and the doc page with a number of workarounds. There is even ssh tunnel (how is that made in into enterprise docs?!) and the classic - copy logged in credentials from UI machine once you logged in there. Bla-bla-bla and again classic:

"Root Cause

The backend enforces an Enterprise-only entitlement for codex_device_code_auth on POST /backend-api/accounts/{account_id}/beta_features. Your account is on the Team plan, so the server rejects the toggle with {"detail":"Enterprise plan required."} "

and so on and so forth. At any given day i have several such long-term tickets that get ultimately escalated to me (i'm in dev and usually the guy who would pull the page with ssh tunnel or credentials copying :)

amluto · 2026-01-31T05:16:07 1769836567

Sort of?

The backstory here is that codex-rs (OpenAI’s CLI agent harness) launched an actual headless login mechanism, just like Claude Code has had forever. And it didn’t work, from day one. And they can’t be bothered to revert it for some reason.

Sure, big enterprises are inept. But this tool is fundamentally a command line tool. It runs in a terminal. It’s their answer to one of their top two competitors’ flagship product. For a company that is in some kind of code red, the fact that they cannot get their ducks in a row to fix it is not a good sign.

Keep in mind that OpenAI is a young company. They should have have a thicket of ancient garbage to wade through to fix this — it’s not as if this is some complex Active Directory issue that no one knows how to fix because the design is 30-40 years old and supports layers and layers of legacy garbage.

leptons · 2026-01-31T04:36:40 1769834200

Funny that they can't just get the "AI" to fix it.

tonyedgecombe · 2026-01-31T06:34:36 1769841276

I expect the “AI” created it in the first place.

viraptor · 2026-01-31T04:58:13 1769835493

You still need to get engineers to actually dispatch that work, test it, possibly update the backend. Each of those can be already done via AI, but actually doing that in a large environment - we're not there yet.

iLoveOncall · 2026-01-31T10:22:46 1769854966

This issue has one thumbs up, nobody cares about it.

amluto · 2026-01-31T15:37:03 1769873823

Because approximately zero smallish businesses use Codex, perhaps?

It’s also possible that the majority of people hitting it are using the actual website support (which is utterly and completely useless), since the bug is only a bug in codex-rs to the extent that codex-rs should have either reverted or deployed a workaround already.

amluto · 2026-01-30T04:34:57 1769747697

Waymo’s performance, once the pedestrian was revealed, sounds pretty good. But is 17mph a safe speed at an active school dropoff area? I admit that I don’t think I ever personally pay attention to the speedometer in such a place, but 17mph seems excessive even for an ordinary parking lot.

I wonder whether Waymo’s model notices that small children are present or likely to be present and that it should leave extra margin for error.

(My general impression observing Waymo vehicles is that they’ve gone from being obnoxiously cautious to often rather aggressive.)

izacus · 2026-02-01T10:08:21 1769940501

I bet most drivers plow through that area at 30mph (since it's 25mph limit) instead if driving as slow as 16.

Even people being all indignant on HN.

jopsen · 2026-02-02T03:25:59 1770002759

True,

But that's not people being rational. That's people being dumb and impatient -- most of us will admit we've done impatient things in a car.

But shouldn't an AV drive like we wish we would drive on our best behavior?

aanet · 2026-01-30T06:41:30 1769755290

> But is 17mph a safe speed at an active school dropoff area?

Now you're asking interesting questions... Technically, in CA, the speed limit in school zones are 25 mph (which local authorities can change to 15 mph, as needed). In this case, that would be something the investigation would check, of course. But regardless of that, 17 mph per se is not a very fast speed (my gut check: turning around intersections at > 10-11 mph feels fast, but going straight at 15-20 mph doesnt feel fast; YMMV). But more generally, in the presence of child VRUs (vulnerable road users), it is prudent to drive slowly just because of the randomness factor (children being the most unaware of critters). Did the Waymo see the kids around in the area? If so, how many and where? and how/where were they running/moving to? All of that is investigation data...

My 2c is that Waymo already took all of that into account and concluded that 17 mph was indeed a good speed to move at...

...which leads to your observation below:

> (My general impression observing Waymo vehicles is that they’ve gone from being obnoxiously cautious to often rather aggressive.)

Yes, I have indeed made that same observation. The Waymos of 2 years ago were very cautious; now they seem much more assertive, even a bit aggressive (though that would be tough to define). That is a driving policy decision (cautious vs assertive vs aggressive).

One could argue if indeed 17 mph was the "right" decision. My gut feel is Waymo will argue that (but likely they might make the driving policy more cautious esp in presence of VRUs, and child VRUs particularly)

veltas · 2026-01-30T08:24:21 1769761461

> Technically, in CA, the speed limit in school zones are 25 mph

Legally a speed limit is a 'limit' on speed, not a suggested or safe speed. So it's never valid to argue legally that you were driving under the limit, the standard is that you slow down or give more room for places like a school drop-off while kids are being dropped off or picked up.

franktankbank · 2026-01-30T15:17:51 1769786271

Yep, if I plow into stationary vehicles on the highway while going the "limit" that's not a very solid defense is it?

aanet · 2026-01-30T19:27:53 1769801273

> Yep, if I plow into stationary vehicles on the highway while going the "limit" that's not a very solid defense is it?

Well, people are doing a lot of what-about-ism in this situation. Some of that is warranted, but I'd posit that analyzing one "part" of this scenario in isolation is not helpful, nor is this the way Waymo will go about analyzing this scenario with their tech teams.

Let's consider, for argument's sake, if the Waymo bot had indeed slammed at the brakes with max decel, and had come to a complete (and sudden) stop barely 5cm in front of the kid. Would THAT be considered a safe response??

If I'm a regulator, I'd still ding the bot with an "unsafe response" ticket and send that report to Waymo. If YOU were that pedestrian, you'd feel unsafe too. (I definitely have seen such responses in my AV testing experience). One could argue, again, that that woulda been legally not-at-fault, but socially that would be completely unacceptable (as one would guess rightly).

And so it is.

The full behavior sequence is in question: When did Waymo see the kid(s), where+ how were they moving, how did it predict (or fail to) where they will move in the next 2s, etc. etc. The entire sequence -- from perception to motion prediction to planning to control -- will be evaluated to understand where the failure for a proper response may have occurred.

As I mentioned earlier, the proper response is, under ideal conditions, one that would have caused the vehicle to stop at a safe distance from the VRU (0.5m-1m, ideally). Failing which, to reduce the kinetic energy to a minimum possible ("min expected response")... which may still imply a "contact" (=collision) but at reduced momentum, to minimize the chance of damage.

I suspect (though I dont know for sure) that Waymo executed the minimum expected response, and that likely was due to the driving policy.

We won't know until we see the full sequence from inside the Waymo. Everything else is speculation.

[Disclaimer: I dont work for Waymo; no affiliation, etc etc]

amluto · 2026-01-29T22:01:51 1769724111

Even some much simpler things are extremely half baked. For example, here’s one I encountered recently:

    alignas(16) char buf[128];

What type is buf? What alignment does that type have? What alignment does buf have? Does the standard even say that alignof(buf) is a valid expression? The answers barely make sense.

Given that this is the recommended replacement for aligned_storage, it’s kind of embarrassing that it works so poorly. My solution is to wrap it in a struct so that at least one aligned type is involved and so that static_assert can query it.

bluGill · 2026-01-30T13:17:46 1769779066

The only people who write code like that have plenty of time to understand those questions - and why the correct answer is what it is is critically important to that line of code working correctly. The vast majority of us would never write a line like that - we let the compiler care about those details. the vast majority of the time 'just use vector' is the right answer that has zero real world exceptions.

but in the rare case you need code like that be glad C++ has you covered

amluto · 2026-01-30T21:48:24 1769809704

> and why the correct answer is what it is is critically important to that line of code working correctly.

> but in the rare case you need code like that be glad C++ has you covered

I strongly disagree. alignof(buf) works correctly but is a GCC extension. alignof(decltype(buf)) is 1, because alignas is a giant kludge instead of a reasonable feature. C++ only barely has me covered here.

amluto · 2026-01-29T21:57:05 1769723825

I think that SFINAE and, to a lesser extent, concepts is fundamentally a bit odd when multiple translation units are involved, but otherwise I don’t see the problem.

It’s regrettable that the question of whether a type meets the requirements to call some overload or to branch in a particular if constexpr expression, etc, can depend on what else is in scope.

direwolf20 · 2026-01-29T23:52:18 1769730738

This is one of those wicked language design problems that comes up again and again across languages, and they solve it in different ways.

In Haskell, you can't ever check that a type doesn't implement a type class.

In Golang, a type can only implement an interface if the implementation is defined in the same module as the type.

In C++, in typical C++ style, it's the wild west and the compiler doesn't put guard rails on, and does what you would expect it to do if you think about how the compiler works, which probably isn't what you want.

I don't know what Rust does.

pornel · 2026-01-30T01:08:52 1769735332

Rust's generics are entirely type-based, not syntax-based. They must declare all the traits (concepts) they need. The type system has restrictions that prevent violating ODR. It's very reliable, but some use-cases that would be basic in C++ (numeric code) can be tedious to define.

Generic code is stored in libraries as MIR, which is half way between AST and LLVM IR. It's still monomorphic and slow to optimize, but at least doesn't pay reparsing cost.

fooker · 2026-01-30T04:07:38 1769746058

Rust gets around the shortcomings of its generics by providing an absurdly powerful macro engine.

It's a great idea when not abused too much for creating weird little DSLs that no one is able to read.

direwolf20 · 2026-01-30T01:22:23 1769736143

How does it handle an implementation of a trait being in scope in one compilation unit and out of scope in another? That's the wicked problem.

amluto · 2026-01-30T03:23:40 1769743420

It’s impossible (?) due to the “coherence” rule. A type A can implement a trait B in two places: the crate where A is defined or the crate where B is defined. So if you can see A and B, you know definitely whether A implements B.

The actual rule is more complex due to generics:

https://github.com/rust-lang/rfcs/blob/master/text/2451-re-r...

and that document doesn’t actually seem to think that this particular property is critical.

amluto · 2026-01-29T19:28:19 1769714899

Whoa, did Tesla pull an Apple? Siri used to work okay on the iPhone, but once it got LLMed it frequently sits there indefinitely while failing to make any progress on even the simplest commands.

wilg · 2026-01-29T20:30:06 1769718606

Apple did an even worse job than you think: they didn't even LLM Siri so I guess it just broke.

amluto · 2026-01-28T20:03:38 1769630618

I have precisely one Windows thing I use regularly, and it has a giant window that needs lots of pixels, and I use it over Remote Desktop. The results are erratic and frequently awful.