Seedream 5 Lite is honestly extremely disappointing, its text to image is way worse than 4.5, image editing is fine but that's it. It's way, wayy behind NB2.
The OP's comment to the post is clearly Markdown-formatted, real humans don't write like that on HN.
The readme is very obviously Claude-written (or a similar model - certainly not GPT), if you check enough vibecoded projects you'll easily spot those readmes.
The style of the HTML page, as noted by others.
Useless comments in the source code, which humans also do, but LLMs do more often:
I did not. The html was generated by Deepseek. Claude is far way too expensive for that. This is only an experimental code. I don't think it is worth to pay Claude to test a code which was already peer reviewed theoretically.
I swear, I am starting to feel like these complaints about how "obviously" something is AI written are the human equivalent of "you are absolutely right" -- it's like some kind of automatic response now
I don't know how to explain it, but I've interacted with LLMs for multiple years now, and especially a lot of time with the recent-ish frontier models, so I can detect most AI writing quite reliably. Sure, you might disagree, but I'm fairly certain this entire post is an LLM output.
I highly doubt some of those results, GPT 5.2/+codex is incredible for cyber security and CTFs, and 5.3 Codex (not on API yet) even moreso. There is absolutely no way it's below Deepseek or Haiku. Seems like a harness issue, or they tested those models at none/low reasoning?
Just for fun, I ran dnsmasq-backdoor-detect-printf (which has a 0% pass rate in your leaderboard with GPT models) with --agent codex instead of terminus-2 with gpt-5.2-codex and it identified the backdoor successfully on the first try. I honestly think it's a harness issue, could you re-run the benchmarks with Codex for gpt-5.2-codex and gpt-5.2?
Finally, it matches my experience, and it is actually good (as good as the best models for localization, still impressive 0% false positive rate):
https://quesma.com/benchmarks/binaryaudit/
Will rerun it on GPT-5.3-Codex shortly, as API is out (yet, the effort does not work correctly, and for "medium" it is very low).
To be honest, it is also our surprise. I mean, I used GPT 5.2 Codex in Cursor for decompiling an old game and it worked (way better than Claude Code with Opus 4.5).
We tested for Opus 4.6, but waiting for public API to test on GPT 5.3 Codex.
At the same time, various task can be different, and now all things that work the best end-to-end are the same as ones that are good for a typical, interactive workflow.
We used Terminus 2 agent, as it is the default used by Harbor (https://harborframework.com/), as we want to be unbiased. Very likely other frameworks will change the result.
Yeah, they all do sometimes, but the agent decides what to allow and they can choose to not use it. This gives the user full control of the sandbox and you can run the agent in yolo mode.
To be honest, while KolibriOS is open-source, I wouldn't call it "active" that much. MenuetOS has progressed much further than KolibriOS over the years in both performance (it has SMP support!) and being 64-bit.
reply