How can you make sure of that? AFAIK, these SOTA models run exclusively on their developers hardware. So any test, any benchmark, anything you do, does leak per definition. Considering the nature of us humans and the typical prisoners dilemma, I don't see how they wouldn't focus on improving benchmarks even when it gets a bit... shady?
I tell this as a person who really enjoys AI by the way.
As a measure focused solely on fluid intelligence, learning novel tasks and test-time adaptability, ARC-AGI was specifically designed to be resistant to pre-training - for example, unlike many mathematical and programming test questions, ARC-AGI problems don't have first order patterns which can be learned to solve a different ARC-AGI problem.
The ARC non-profit foundation has private versions of their tests which are never released and only the ARC can administer. There are also public versions and semi-public sets for labs to do their own pre-tests. But a lab self-testing on ARC-AGI can be susceptible to leaks or benchmaxing, which is why only "ARC-AGI Certified" results using a secret problem set really matter. The 84.6% is certified and that's a pretty big deal.
IMHO, ARC-AGI is a unique test that's different than any other AI benchmark in a significant way. It's worth spending a few minutes learning about why: https://arcprize.org/arc-agi.
This also seems to contradict what ARC-AGI claims about what "Verified" means on their site.
> How Verified Scores Work: Official Verification: Only scores evaluated on our hidden test set through our official verification process will be recognized as verified performance scores on ARC-AGI (https://arcprize.org/blog/arc-prize-verified-program)
So, which is it? IMO you can trivially train / benchmax on the semi-private data, because it is still basically just public, you just have to jump through some hoops to get access. This is clearly an advance, but it seems to me reasonable to conclude this could be driven by some amount of benchmaxing.
EDIT: Hmm, okay, it seems their policy and wording is a bit contradictory. They do say (https://arcprize.org/policy):
"To uphold this trust, we follow strict confidentiality agreements.
[...] We will work closely with model providers to ensure that no data from the Semi-Private Evaluation set is retained. This includes collaborating on best practices to prevent unintended data persistence. Our goal is to minimize any risk of data leakage while maintaining the integrity of our evaluation process."
But it surely is still trivial to just make a local copy of each question served from the API, without this being detected. It would violate the contract, but there are strong incentives to do this, so I guess is just comes down to how much one trusts the model providers here. I wouldn't trust them, given e.g. https://www.theverge.com/meta/645012/meta-llama-4-maverick-b.... It is just too easy to cheat without being caught here.
The ARC-AGI papers claim to show that training on a public or semi-private set of ARC-AGI problems to be of very limited value in passing a private set. <--- If the prior sentence is not correct, then none of ARC-AGI can possibly be valid. So, before "public, semi-private or private" answers leaking or 'benchmaxing' on them can even matter - you need to first assess whether their published papers and data demonstrate their core premise to your satisfaction.
There is no "trust" regarding the semi-private set. My understanding is the semi-private set is only to reduce the likelihood those exact answers unintentionally end up in web-crawled training data. This is to help an honest lab's own internal self-assessments be more accurate. However, labs doing an internal eval on the semi-private set still counts for literally zero to the ARC-AGI org. They know labs could cheat on the semi-private set (either intentionally or unintentionally), so they assume all labs are benchmaxing on the public AND semi-private answers and ensure it doesn't matter.
They could also cheat on the private set though. The frontier models presumably never leave the provider's datacenter. So either the frontier models aren't permitted to test on the private set, or the private set gets sent out to the datacenter.
But I think such quibbling largely misses the point. The goal is really just to guarantee that the test isn't unintentionally trained on. For that, semi-private is sufficient.
Everything about frontier AI companies relies on secrecy. No specific details about architectures, dispatching between different backbones, training details such as data acquisition, timelines, sources, amounts and/or costs, or almost anything that would allow anyone to replicate even the most basic aspects of anything they are doing. What is the cost of one more secret, in this scenario?
> Because the gains from spending time improving the model overall outweigh the gains from spending time individually training on benchmarks.
This may not be the case if you just e.g. roll the benchmarks into the general training data, or make running on the benchmarks just another part of the testing pipeline. I.e. improving the model generally and benchmaxing could very conceivably just both be done at the same time, it needn't be one or the other.
I think the right take away is to ignore the specific percentages reported on these tests (they are almost certainly inflated / biased) and always assume cheating is going on. What matters is that (1) the most serious tests aren't saturated, and (2) scores are improving. I.e. even if there is cheating, we can presume this was always the case, and since models couldn't do as well before even when cheating, these are still real improvements.
And obviously what actually matters is performance on real-world tasks.
I have Claude Max plan which makes me feel like I could code anything. I'm not talking about vibe-coding greenfield projects. I mean, I can throw it in any huge project, let it figure out the architecture, how to test and run things, generate a report on where it thinks I should start... Then I start myself, while asking claude code for very very specific edits and tips.
I also can create a feedback loop and let it run wild, which also works but that needs also planning and a harness, and rules etc. Usually not worth it if you need to jump between a million things like me.
Could it be that you're creating a stereotype in your head and getting angry about it?
People say these things against any group they dislike. It's so much that these days it feels like most of the social groups are defined by outsiders with the things they dislike about them.
Human intelligence is the one with regular 6-8 hour outages every day, with day-long degradation if those outages were not "deep" enough. Not to mention dementia.
Those usually didn't have keys to all your data. Worst case, you lost your server, and perhaps you hosted your emails there too? Very bad, but nothing compared to the access these clawdbot instances get.
> Those usually didn't have keys to all your data.
As a former (bespoke) WP hosting provider, I'd counter those usually did. Not sure I ever met a prospective "online" business customer's build that didn't? They'd put their entire business into WP installs with plugins for everything.
Our step one was to turn WP into static site gen and get WP itself behind a firewall and VPN, and even then single tenant only on isolated networks per tenant.
To be fair that data wasn't ALL about everyone's PII — until by ~2008 when the Buddy Press craze was hot. And that was much more difficult to keep safe.
IMHO, you should deal with actual events, when not ideas, instead of people. No two people share the exact same values.
For example, you assume that guy trying to cut the line is a horrible person and a megalomaniac because you've seen this like a thousand times. He really may be that, or maybe he's having an extraordinarily stressful day, or maybe he's just not integrated with the values of your society ("cutting the line is bad, no matter what") or anything else BUT none of all that really helps you think clearly. You just get angry and maybe raise your voice when you're warning him, because "you know" he won't understand otherwise. So you left your values now too because you are busy fighting a stereotype.
IMHO, correct course of action is assuming good faith even with bad actions, and even with persistent bad actions, and thinking about the productive things you can do to change the outcome, or decide that you cannot do anything.
You can perhaps warn the guy, and then if he ignores you, you can even go to security or pick another hill to die on.
I'm not saying that I can do this myself. I fail a lot, especially when driving. It doesn't mean I'm not working on it.
I used to think like this, and it does seem morally sound at first glance, but it has the big underlying problem of creating an excellent context in which to be a selfish asshole.
Turns out that calling someone on their bullshit can be a perfectly productive thing to do, it not only deals with that specific incident, but also promotes a culture in which it's fine to keep each other accountable.
I think they're both good points. An unwillingness to call out bullshit itself leads to a systemic dysfunction but on the flip side a culture where everyone just rages at everything simply isn't productive. Pragmatically, it's important to optimize for the desired end result. I think that's generally going to be fixing the system first and foremost.
It's also important to recognize that there are a lot of situations where calling someone out isn't going to have any (useful) effect. In such cases any impulsive behavior that disrupts the environment becomes a net negative.
I honestly think this would qualify as "ruinous empathy"
It's fine and even good to assume good faith, extend your understanding, and listen to the reasons someone has done harm - in a context where the problem was already redressed and the wrongdoer is labelled.
This is not that. This is someone publishing a false paper, deceiving multiple rounds of reviewers, manipulating evidence, knowingly and for personal gain. And they still haven't faced any consequences for it.
I don't really know how to bridge the moral gap with this sort of viewpoint, honestly. It's like you're telling me to sympathise with the arsonist whilst he's still running around with gasoline
> I don't really know how to bridge the moral gap with this sort of viewpoint, honestly. It's like you're telling me to sympathise with the arsonist whilst he's still running around with gasoline
That wasn't how I read it. Neither sympathize nor sit around doing nothing. Figure out what you can do that's productive. Yelling at the arsonist while he continues to burn more things down isn't going to be useful.
Assuming good faith tends to be an important thing to start with if the goal is an objective assessment. Of course you should be open to an eventual determination of bad faith. But if you start from an assumption of bad faith your judgment will almost certainly be clouded and thus there is a very real possibility that you will miss useful courses of action.
The above is on an individual level. From an organizational perspective if participants know that a process could result in a bad faith determination against them they are much more likely to actively resist the process. So it can be useful to provide a guarantee that won't happen (at least to some extent) in order to ensure that you can reliably get to the bottom of things. This is what we see in the aviation world and it seems to work extremely well.
I thought assuming good faith does not mean you have to sympathize. English is not my native language and probably that's not the right concept.
I mean, do not put the others into any stereotype. Assume nothing? Maybe that sounds better. Just look at the hand you are dealt and objectively think what to do.
If there is an arsonist, you deal with that a-hole yourself, call the police, or first try to take your loved ones to safety first?
I tell this as a person who really enjoys AI by the way.
reply