GPQA Diamond: gpt-oss-120b: 80.1%, Qwen3-235B-A22B-Thinking-2507: 81.1% Humanity...

jasonjmcghee · 2025-08-05T17:19:18 1754414358

Wow - I will give it a try then. I'm cynical about OpenAI minmaxing benchmarks, but still trying to be optimistic as this in 8bit is such a nice fit for apple silicon

modeless · 2025-08-05T17:41:49 1754415709

Even better, it's 4 bit

amarcheschi · 2025-08-05T17:20:13 1754414413

Glm 4.5 seems on par as well

thegeomaster · 2025-08-05T17:25:45 1754414745

GLM-4.5 seems to outperform it on TauBench, too. And it's suspicious OAI is not sharing numbers for quite a few useful benchmarks (nothing related to coding, for example).

One positive thing I see is the number of parameters and size --- it will provide more economical inference than current open source SOTA.

lcnPylGDnU4H9OF · 2025-08-05T17:38:52 1754415532

Was the Qwen model using tools for Humanity's Last Exam?