Hacker Newsnew | past | comments | ask | show | jobs | submit | Tiberium's commentslogin

Seedream 5 Lite is honestly extremely disappointing, its text to image is way worse than 4.5, image editing is fine but that's it. It's way, wayy behind NB2.

It's only for Europe, you should try a US VPN or, in the worst case, use it over Vertex AI, which allows you to generate anyone.

The OP's comment to the post is clearly Markdown-formatted, real humans don't write like that on HN.

The readme is very obviously Claude-written (or a similar model - certainly not GPT), if you check enough vibecoded projects you'll easily spot those readmes.

The style of the HTML page, as noted by others.

Useless comments in the source code, which humans also do, but LLMs do more often:

// Basic random double

static inline double rand_double() { return (double)rand() / (double)RAND_MAX; }


I did not. The html was generated by Deepseek. Claude is far way too expensive for that. This is only an experimental code. I don't think it is worth to pay Claude to test a code which was already peer reviewed theoretically.

I'm sorry for the issue, though I couldn't help but notice - you want to talk to a real human, yet this very post is completely LLM-written/edited.

I swear, I am starting to feel like these complaints about how "obviously" something is AI written are the human equivalent of "you are absolutely right" -- it's like some kind of automatic response now

I don't know how to explain it, but I've interacted with LLMs for multiple years now, and especially a lot of time with the recent-ish frontier models, so I can detect most AI writing quite reliably. Sure, you might disagree, but I'm fairly certain this entire post is an LLM output.

its turtle bots all the way down

I highly doubt some of those results, GPT 5.2/+codex is incredible for cyber security and CTFs, and 5.3 Codex (not on API yet) even moreso. There is absolutely no way it's below Deepseek or Haiku. Seems like a harness issue, or they tested those models at none/low reasoning?

As I do eval and training data sets for living, in niche skills, you can find plenty of surprises.

The code is open-source; you can run it yourself using Harbor Framework:

git clone git@github.com:QuesmaOrg/BinaryAudit.git

export OPENROUTER_API_KEY=...

harbor run --path tasks --task-name lighttpd-* --agent terminus-2 --model openrouter/anthropic/claude-opus-4.6 --model openrouter/google/gemini-3-pro-preview --model openrouter/openai/gpt-5.2 --n-attempts 3

Please open PR if you find something interesting, though our domain experts spend fair amount of time looking at trajectories.


Just for fun, I ran dnsmasq-backdoor-detect-printf (which has a 0% pass rate in your leaderboard with GPT models) with --agent codex instead of terminus-2 with gpt-5.2-codex and it identified the backdoor successfully on the first try. I honestly think it's a harness issue, could you re-run the benchmarks with Codex for gpt-5.2-codex and gpt-5.2?

Are the existing trajectories from your runs published anywhere? Or is the only way is for me to run them again?

I can provide trajectories. Though probably we are not going to publish them this time. This would need some extra safeguards.

Email me. The address is in profile.


I rerun it for GPT-5.2-Codex, for high and xhigh.

Finally, it matches my experience, and it is actually good (as good as the best models for localization, still impressive 0% false positive rate): https://quesma.com/benchmarks/binaryaudit/

Will rerun it on GPT-5.3-Codex shortly, as API is out (yet, the effort does not work correctly, and for "medium" it is very low).


To be honest, it is also our surprise. I mean, I used GPT 5.2 Codex in Cursor for decompiling an old game and it worked (way better than Claude Code with Opus 4.5). We tested for Opus 4.6, but waiting for public API to test on GPT 5.3 Codex.

At the same time, various task can be different, and now all things that work the best end-to-end are the same as ones that are good for a typical, interactive workflow.

We used Terminus 2 agent, as it is the default used by Harbor (https://harborframework.com/), as we want to be unbiased. Very likely other frameworks will change the result.


Codex already uses sandbox-exec on macOS :)

Yeah, they all do sometimes, but the agent decides what to allow and they can choose to not use it. This gives the user full control of the sandbox and you can run the agent in yolo mode.

Did you use a European LLM to write this article? Or was it an American one in the end? :)

EDIT: Looks like it's an American one in the end, oh well. https://news.ycombinator.com/item?id=47085756


Slop text generation is equally good with chinese and european LLMs don't worry about that part

I still have GLM/Qwen or Deepseek sometimes randomly adding Chinese characters to things... :)

I suspect the download count was also trivially gamed, so I doubt many people got infected with this in reality.

Another new LLM slop account on HN..

To be honest, while KolibriOS is open-source, I wouldn't call it "active" that much. MenuetOS has progressed much further than KolibriOS over the years in both performance (it has SMP support!) and being 64-bit.

You can check the commit activity: https://git.kolibrios.org/KolibriOS/kolibrios/commits/branch... - last commit on the first page is already 10 months ago.

And compare it to "News" on the MenuetOS page: - 22.01.2026 M64 1.58.10 released - Improvements, bugfixes, additions

- 26.08.2024 M64 1.53.60 released - MPlayer included to disk image

- 24.07.2024 M64 1.52.00 released - Partial Linux layer (X-Window/Posix/Elf)

- 12.07.2024 M64 1.51.50 released - New graphics designs by Yamen Nasr

- 08.05.2024 M64 1.50.80 released - Fasm-G, many 32 bit apps & sources



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: