Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> GPT‑5 also excels at long-running agentic tasks—achieving SOTA results on τ2-bench telecom (96.7%), a tool-calling benchmark released just 2 months ago.

Yes, but it does worse than o3 on the airline version of that benchmark. The prose is totally cherry picker.



I wrote that section and made the graphs, so you can blame me. We no doubt highlight the evals that make us look good, but in this particular case I think the emphasis on telecom isn't unprincipled cherry picking.

Telecom was made after retail & airline, and fixes some of their problems. In retail and airline, the model is graded against a ground truth reference solution. But in reality, there can be multiple solutions that solve the problem, and perfectly good answers can receive scores of 0 by the automatic grading. This, along with some user model issues, is partly why airline and retail scores haven't climbed with the latest generations of models and are stuck around 60% / 80%. Even a literal superintelligence would probably plateau here.

In telecom, the authors (Barres et al.) made the grading less brittle by grading against outcome states, which may be achieved via multiple solutions, rather than by matching against a single specific solution. They also improved the user modeling and some other things too. So telecom is the much better eval, with a much cleaner signal, which is partly why models can score as high as 97% instead of getting mired at 60%/80% due to brittle grading and other issues.

Even if I had never seen GPT-5's numbers, I like to think I would have said ahead of time that telecom is much better than airline/retail for measuring tool use.

Incidentally, another thing to keep in mind when critically looking at OpenAI and others reporting their scores on these evals is that the evals give no partial credit - so sometimes you can have very good models that do all but one thing perfectly, which results in very poor scores. If you tried generalizing to tasks that don't trigger that quirk, you might get much better performance than the eval scores suggest (or vice versa, if they trigger a quirk not present in the eval).

Here's the tau2-bench paper if anyone wants to read more: https://arxiv.org/abs/2506.07982


Thanks for your input!


OpenAI hiring BCG alumni is all we need to know


No need to be like that mate.


How does the cost compare though? From my understanding o3 is pretty expensive to run. Is GPT-5 less costly? If so if the performance is close to o3 but cheaper, then it may still be a good improvement.


I find it strange that GPT-5 is cheaper than GPT-4.1 in input token and is only slightly more expensive in output token. Is it marketing or actually reflecting the underlying compute resources?


Very likely to be an actual reflection. That's probably their real achievement here and the key reason why they are actually publishing it as GPT-5. More or less the best or near to it on everything while being one model, substantially cheaper than the competition.


But it can’t do audio in/out or image out. Feels like an architectural step back.


My understanding is that image output is pretty separate and if it doesn’t seem that way, they’re just abstracting several models into one name


Maybe with the router mechanism (to mini or standard) they estimate the average cost will be a lot lower for chatgpt because the capable model won’t be answering dumb questions and then they pass that on to devs?


I think the router applies to chatgpt app. The developer APIs expose manual control to select the specific model and level of reasoning.


I mean... they themselves included that information in the post. It's not exactly a gotcha.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: