Hacker Newsnew | past | comments | ask | show | jobs | submit | rakejake's commentslogin

>> Test it yourself, GPT 120B OSS is cheap and available. BTW, this is why with this bug, the stronger the model you pick (but not enough to discover the true bug), the less likely it is it will claim there is a bug.

I guess this is the crux of the debate. All the claims are comparing models that are available freely with a model that is available only to limited customers (Mythos). The problem here is with the phrase "better model". Better how? Is it trained specifically on cybersecurity? Is it simply a large model with a higher token/thinking budget? Is it a better harness/scaffold? Is it simply a better prompt?

I don't doubt that some models are stronger that other models (a Gemini Pro or a Claude Opus has more parameters, higher context sizes and probably trained for longer and on more data than their smaller counterparts (Flash and Sonnet respectively).

Unless we know the exact experimental setup (which in this case is impossible because Mythos is completely closed off and not even accessible via API), all of this is hand wavy. Anthropic is definitely not going to reveal their setup because whether or not there is any secret sauce, there is more value to letting people's imaginations fly and the marketing machine work. Anthropic must be jumping with joy at all the free publicity they are getting.


In the Anthropic Mythos model cards they explicitly remarked that they didn't want Mythos to be specifically good at security. They trained it to be good at coding, and as a side effect the model is (obviously) good at security. This what happens with flesh hackers too, mostly. Hackers are very good programmers, as a side effect they understand systems well enough that their understanding has security implications.

Model cards are just marketing material. I wouldn’t trust them one bit.

You don't need to trust anyone. GPT 5.4 xhigh is available and you can test it for $20, to verify it is actually able to find complex bugs in old codebases. Do the work instead of denying AI can do certain things. It's a matter of an afternoon. Or, trust the people that did this work. See my YouTube video where I find tons of Redis bugs with GPT 5.4.

I did not claim or deny anything. You cited the model card, I just pointed out that this is no reliable source. If you have better sources, like your YT video, you should cite those instead.

You are claiming something: that the model card is not reliable, therefore it's as useful as nothing. Sowing doubt without a possible solution adds little value to the conversation. Moreover, your rebuttal is unsubstantiated.

Guys, think about all the security vulnerabilities you're aware of; now, think about how many of those you know how to technically reproduce. Now imagine that you actually don't know how to reproduce most things and you're never actually be able to judge the result.

Well, just cause these are all AI people doesn't mean they verified enough of the output of these models to actually provide the significant security implications they're advertising.


The whole discussion started out as an attempt to disprove/verify anthropics (model card) claims.

He also transfers the logic of their claims to the actual real world. You can say that model cards are marketing garbage. You have to prove that experienced programmers are not significantly better at security.


> You have to prove that experienced programmers are not significantly better at security.

That has not been my experience. It's true that they are "better at security" in the sense that they know to avoid common security pitfalls like unparamaterized SQL, but essentially none of them have the ability to apply their knowledge to identify vulnerabilities in arbitrary systems.


An expert level human doesn't have to be expert at every programming category. A webdev wouldn't spot a use after free. A systems engineer wouldn't know about CSRF. That is if both don't research security beyond their field. Requiring a programmer to apply their knowledge to an arbitrary system is asking too much. On the other hand and LLM can be expert level in every programming field, able to spot and combine vulnerabilities creatively. That is all pretty hard and I don't think an security expert with vast knowledge would say "that's easy".

My point is that more experienced programmers are better at security on average, not that they are security experts.


I would think pwn2own competitions would signal the opposite. I'm consistently and often amazed at how a unique combination of exploits can bring a larger exploit and often in ways that most wouldn't even consider. I think it takes a level of knowledge, experience, creativity and paranoia to be really good with security issues all around as a person.

> essentially none of them have the ability to apply their knowledge to identify vulnerabilities in arbitrary systems.

I've found it to be the opposite. Many of them do have the ability to apply their knowledge in that fashion. They're just either not incentivised to do so, or incentivised to not do so.


And overfitting benchmarks can easily be gamed. Yet here we are with the top HN comment on the HN Mythos thread outlining it's benchmarking performance gains.

I guess we'll never learn.


But they are treated as holy scripture ...

> Hackers are very good programmers

This does not match my experience.


The missing part of their intended meaning is "skilled hackers". Unskilled hackers are everywhere, and they're bad at programming, but so are unskilled programmers.

>>> the model is (obviously) good at security

Out of curiosity, are you one of the people who has access to the model? If yes, could you write about your experimental setup in more detail?


If its really more expensive per token, it might have more parameters and is then able to hold more context/scope of code.

Rumors say it has 10 trillion parameter vs. 1 trillion.


Yes, that does track with my personal experience. More context, more params and no quantization is probably it. But my hunch is that all the training data they've been getting in the past year also plays a part here. More than any other lab, anthropic's focus on coding right from the beginning gives them access to the best training data (several githubs worth). Most of this code comes with human feedback and anthropic even has data on how many went to production, got reverted etc. No need to pay for human labeling when your customers are doing it for you. This is their secret sauce.

Mythos isn't restricted for marketing purposes - that would be incredibly dumb because Anthropic would be giving up first mover advantage for next gen models.

It's restricted because it's genuinely good at finding vulnerabilities, and employees felt that it's not a good idea to give this capability to everyone without letting defenders front-run.

That's it. That's all there is to it. It is not some grand marketing play.


>It's restricted because it's genuinely good at finding vulnerabilities, and employees felt that it's not a good idea to give this capability to everyone without letting defenders front-run.

It's a possibility, but it doesn't eliminate the possibility that it's hype. If these claims were indeed serious, they would submit it for independent analysis somewhere.

This isn't some crazy process. Defense contractors are required to submit their systems (secret sauce and all) for operational test and evaluation before they're fielded.


> If these claims were indeed serious, they would submit it for independent analysis somewhere.

They have. 40 different companies that have all committed resources to patching their systems based on vulnerabilities found by Mythos. One of them, Google, is a frontier AI lab that pointedly did not say that their own models have found similar vulnerabilities.

> Defense contractors are required to submit their systems (secret sauce and all) for operational test and evaluation before they're fielded.

Does this look something like having 40 separate companies look at the outputs of the system, deciding that it’s real and they should do something about it, and committing resources to it?

At some point, “cynicism” is another word for “lalala can’t hear you”.


Another cross-check I've run is, are the claims Anthropic is making for Mythos that out of line with the current status of AI coding assistents?

To which my answer is clearly, no, not even remotely. If Anthropic is outright lying about what Mythos can do, someone else will have it in a year.

In fact the security world would have to seriously consider the possibility that even if Mythos didn't exist that nation states have the equivalent in hand already. And of course, if Mythos does exist, nation states have it now. The odds that Antropic (and every other AI vendor) isn't penetrated enough by every major intelligence agency such that they have access to their choice of model approach zero.

I wonder about the overlap between people being skeptical of Mythos' capabilities, and those who are too skeptical of AI to have spent any time with it because they assume it can't be any good. If you are not aware of what frontier models routinely do, you may not realize that Mythos is just an evolution of existing capabilities, not a revolution. Even just taking a publicly-available frontier model, pointing it at a code base and telling it to "find the vulnerabilities and write exploits" produces disturbingly good results. I can see the weaknesses referenced by the Mythos numbers, especially around the actual writing of the exploits, but it's not like the current frontier models fall on their face and hallucinate wildly for this task. Most everything they produce when I try this is at least a "yeah, that's worth thinking about" rather than an instant dismissal.


Sure, I am not precluding the possibility that they've trained a genuinely great model. All I am saying is that the "this model better than that model" is moot when on one side you have model weights, and on the other side a whitepaper and some accompanying comments on the danger.

I'm not that old but have been here long enough that I remember when GPT-3 was considered too dangerous to release. Now you have models 10x as good, 1/10th the size and run on 8GB VRAM.


That safety stuff is almost always quacks whose job it is to exaggerate LLMs at their non profits or marketing hype that "our models are so powerful you should fear them". Then they release them and the world moves on and adapts.

Mythos will benefit security in the long run more than hackers, if it can do what they claim. And there's nothing that will stop an LLM like it from being released in the near term so it's very likely just resource constraints or marketing


I don't think you can say this with confidence, outside-in. It's not just about safety. The additional unknown is cost - I don't just mean API cost, but fully loaded cost for a given task. Is the model cost effective for tasks such that it has product market fit?

We don't yet know if Mythos was a level shift in the capability/cost frontier, or a continued extension of the same logarithmic capability/cost curve.


Some people have access to the model for red team purposes as part of Glasswing and they came away quite spooked according to what I heard

I don't doubt it, I just mean the decision to release/not release generally may also be informed by the commercial/economic viability of the model for general usage patterns versus extremely high value patterns like vulnerability assessment

If it wasn't marketing it wouldn't have fancy branding... It wouldn't even be announced.

Or, They created the illusion that it's restricted for security reasons but in reality they just lack the necessary for this to be used widespread!

it seems likely it's both a better model to some unknown extent and doing this "we have to give it to the defenders first" thing is super great marketing material. it seems an entirely natural marketing campaign "announce that we can't even give the model to everyone at first, it's so great!", plus there's some truth to it, even better.

unless you are an employee at anthropic and shouldn't be talking about any of this at all, there's no way to know what the model's capabilities are.


How do you know? If you have access you are not unbiased, otherwise you cannot know by definition.

AI companies routinely claim that something is too dangerous to release (I think GPT-2 was the first case) for marketing reasons. There are at least 10 documented high profile cases.

They keep it secret because they now sell to the MIC with China and North Korea bullshit stories as well as to companies who are invested in the AI hype themselves.


I prefer a more cautios approach than the musk style were stuff gets fixed after.

And with gpt-2 the worry was mass emails a lot better and more detailed and personal, social media campaigns etc.

How many bots are deployed today on X and influencing democrazy around the globe?

Its fair to say it had an impact and LLMs still have.


> How do you know? If you have access you are not unbiased, otherwise you cannot know by definition.

The platonic ideal of how to dismiss any argument by anyone about anything.


GPT-2 was obviously too dangerous to release at the time! It's OK-ish now, when the knowledge that AI can produce arbitrary text is widely shared. It would have been a disaster for scammers and phishers to get GPT-2 at a time when almost everyone still assumed that large volumes of detailed text proved there's a real human being on the other end of the conversation.

And, as we all know, humans can't be scammers. They need the robots to lie.

Maybe they did use small models but you couldn't make the front page of HN with something like this until Anthropic made a big fuss out of it. Or perhaps it is just a question of compute. Not everyone has 20k$ or the GPU arsenal to task models to find vulnerabilities which may/may not be correct?

Unless Anthropic makes it known exactly what model + harness/scaffolding + prompt + other engineering they did, these comparisons are pointless. Given the AI labs' general rate of doomsday predictions, who really knows?


papers are always coming out saying smaller models can do these amazing and terrifying things if you give them highly constrained problems and tailored instructions to bias them toward a known solution. most of these don't make the front page because people are rightfully unimpressed

The word "profound" is a bit overused when it comes to movies. I agree that The Battle of Algiers is an excellent film, one of the best ever made even. One Battle After Another is also excellent but it is not really political in the way the TBoA is. It uses a political setting very effectively in a chase thriller. A movie like The Parallax View is a better comparison. That movie used the post-60s paranoia very effectively in a great suspense thriller.


Yeah, I'm watching a lot of Charlie Chaplin movies in preparation for my new role as a tramp.


I agree on principle. But there is going to be a painful transition where people are still reckoning with the new capabilities so I understand where all the fear/sadness is coming from.

Tbf I think the golden days of being a software dev are over even if the AI were to stagnate and never improve. The spectre of AGI is enough for higher ups to demand more output which will in turn require more hours to be put in by devs. A project that required 2 months will now be allotted 3 weeks because "Agentic coding increases productivity".


> I don't know if the market will have fully internalized that knowledge soon enough

This exactly. I am neither a boomer nor a doomer. It has helped a lot both at work and in accelerating my personal projects. But now that the C-suite and middle management has jumped on the agentic bandwagon, I'm unsure where this will go and what casualties ensue. At the very least, in the short term there's going to be a lot of "Now that we have agents, this project should be achievable in half the time".


This started decently enough and then the author went all over the place. I'm not sure why detailed explanations of neural nets and smart contracts were needed here. It really feels like trying to ram in a tech solution for what is effectively a market/social problem.

Using computers to aid in designing is not specific to Kanchipuram saris. While I realize people always approach it from the POV of saving a dying art, I'm unsure if K.saris can really fall under that umbrella. Clearly the demand is there and the issues here arise due to inefficient and possibly corrupt market practices rather than the art itself dying. A lot of space was used to explain the lopsided economics on the supply side but there's not enough attention paid to the demand side and the marketplace dynamics.


Excellent comment that really gets to the crux of the matter. Countries like China and India see themselves as civilizational, America sees itself as a perfect marketplace - it exists to feed its customers's wants and whims as efficiently as possible. I don't necessarily mean this in a demeaning way, it is what it is. In some sense, America is a state-level example of hedonic adaptation with its positives being improvements in quality of life and development of new tech, negatives being a bully in world politics, endless wars and bloodshed.

In general, hedonic adaption ends either with internal retrospection (shifting from pleasure to purpose) or an external disruption. In America's case, the former is extremely unlikely IMHO - the American people will not put their money where their mouth is because they enjoy the wealth generated this way. It will be upto external disruptors to check on Uncle Sam's endless thirst.


As long as people all over the world are using ChatGPT and GMail they have all the intel needed to control the world, just like they won wars by all telegrams going through them in the 1800s.

China is their only competitor, but so far people clearly prefer to chat with AI companies from USA.


I often wonder why Satya Nadella is so venerated on HN compared to say, Cook or Pichai. As innovators, MS lags way behind both Google and Apple. I can't think of one bleeding edge product released during Satya's tenure. Say what you will about Apple and Google, they still consistently put out products that make you sit up and pay attention. What has MS been doing other than squeezing the MS Office and Azure cash cows?


Nadella is obviously a very smart and successful business leader. He achieved his goals and transformed Microsoft into a very successful, healthy company. This is why I personally think he isn’t just a bland idiot like for example Steve Ballmer.

However, it’s clear that Nadella’s goals are everything but noble. He doesn’t care about the product, and he really doesn’t care about the customer. He only cares about number go up.


Ballmer doesn't strike me as an idiot and definitely not bland. He's one of the more colorful tech personalities. MS's almost unassailable lead in enterprise could be attributed to him and the pivot to cloud could not have happened without this. But he definitely fumbled hard on mobile (Windows Phone), Surface (IIRC the initial ARM laptop was a major flop and had a close to 1B+ writeoff) and the disaster that was the Nokia acquisition. I'd say he left at the right time, just as it was becoming clear that MS's bets on Windows Phone and hardware in general weren't paying off.


That's a pretty massive fumble. He was effectively trying to convert Microsoft into a hardware company; Satya marked the return to pure software, and Microsoft is now the biggest SaaS company in the world.


Rather than think of it as a pivot to hardware, I looked at it as MS trying to corner their share in the consumer market. Mobile and Social were the hot things back then and mobile threatened MS's dominance of the OS market. MS ultimately failed but they still owned the enterprise market and continue to keep their lead in desktop market share.


Nadella plays on level easy.

Too many companies are dependent on MS products so it doesn’t matter how bad it gets.

MS rather disables features instead of fixing security issues and now puts AI in everything in a desperate attempt to force the users to use it.


I assumed he was a product guy until I heard him on Dwarkesh's podcast. He does seem to really only get fired up about numbers going up, and customers are a vehicle for that.


For example he made the back then very-very brave decision to completely getting rid of Windows as the leading Microsoft brand. He had a very clear vision for Microsoft and the industry even if the outcome is not super exciting products for you and me. He’s not squeezing Azure - he was the person who made Azure into what it is now.

So he changed Microsoft fundamentally - a very difficult thing for such a large company.

I don’t see Pichai changing Google so fundamentally. I admire Cook though.


> I don’t see Pichai changing Google so fundamentally. I admire Cook though.

Well he did change Google fundamentally. Imagine being so dense you're fumbling to a competitor built on a technology that you innovated .

That being said, I'm still long Google because they're the tortoise. And this is one of those races where slow and steady might actually win. And while I was a strong critic of Pichai on a lot of fronts (just check my past comments!), he still must be given due credit for his measured approach and for navigating Google through some of the roughest regulatory environments, and for leaving Google relatively unscathed.


My point was more that MS hasn't had an industry changing product in a while. Google became joint-SOTA in AI and seems poised to take the crown with the next Gemini, and also in self-driving cars and quantum computing. They've kept their cash cows going while also being up to date on the tech that might upend their business model, so in a way they've cracked the innovator's dilemma which is definitely not an easy thing to do. A lot of HNers even wrote them off after ChatGPT and the disastrous Bard. Apple has a successful mass product in Airpods, a moonshot in Vision Pro and the insane Apple Silicon which they executed over more than a decade.

Nadella did well in the last decade to consolidate the MS stack (Teams, Azure, Office) and to invest in OpenAI when he realized MS's internal efforts wouldn't yield the expected output. He has protected their turf and made some strategic acquisitions like Linkedin and Github to keep their lead in enterprise software. From the POV of Wall Street performance and stock returns, he is a definitely a great CEO but so are Cook, Pichai even Ellison.


Other commenters are raking Apple over the coals for bad experiences with MacOS. By the same token, Windows 11 is beyond awful. It's a complete buggy mess, never mind the secure boot restrictions.


I'm just incredibly disappointed with how Windows has ended up

As the CEO of Microsoft, he must use Windows, right? Unless he has a Mac

Like how can he use that ad-riddled mess every day and think it's fine, knowing he could make it so much better?


VSCode revolutionized the IDE


I guess Atlas is a good name for a web browser. But I'm surprised their first release is Mac only. Does it indicate they are targeting some kind of power user (programmers, creatives etc) or is it just the first platform they could ship by the deadline?

Will they be able to take any significant marketshare from Chrome? I suppose only time will tell but it will be a pretty hard slog especially since Chrome is pretty much synonymous with "browser" in most of the world. Still, I don't think anyone at Google is breathing easy.


They probably just wanted to get it out ASAP with whatever OS they targeted first. Their launch post says:

> Experiences for Windows, iOS, and Android are coming soon.

https://openai.com/index/introducing-chatgpt-atlas/


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: