More

robkop · 2026-04-11T05:01:36 1775883696

I can’t speak for the states, but in AU I clearly see a massive displacement of undergrad and junior roles (only in AI exposed domains).

I say this as both someone who works with many execs, hearing their musings, and someone who no longer can justify hiring junior roles themselves.

Irrespective of that; if we take this strategy of only taking action once it is visible to the layman - our scope of actions available will be invariably and significantly diminished.

Even if you are not convinced it is guaranteed and do not believe what myself and others see. I would ask you is your probability of it happening now really that close to 0? If not then would it not be prudent to take the risk seriously?

jurgenburgen · 2026-04-11T08:34:52 1775896492

> If not then would it not be prudent to take the risk seriously?

What does taking the risk seriously look like?

bigbadfeline · 2026-04-11T15:47:54 1775922474

> What does taking the risk seriously look like?

Politics - proper guardrails, adapting the legal framework to accommodate AI and make sure it doesn't benefit only preselected few.

Something that can and should be done yesterday is to stop the capital drain out of the economy and into accelerated, war-motivated AI development - there's no need for war-AI per se but clearly it's the most likely reason for the capital drain and rush.

Once the rush and wars stop, and some capital is made available for the rest of the economy, the latter can adapt to the introduction of AI at a normal pace, that should include legislative safeguards to support competition and prevent monopolization of AI and information sources.

robkop · 2026-04-11T03:21:42 1775877702

one of their highlights with mythos was it's ability to generate new puns

I took a look and honestly they're the first AI puns that aren't bad

Times are changing

doubled112 · 2026-04-11T03:43:08 1775878988

Trained with the conversations of one million dads and their kids, captured by Amazon Echo.

CreepGin · 2026-04-11T18:50:33 1775933433

I'm not sure if this is mythos-specific though. Past models have been great at puns! They do wordplay and puns reasonably well because those are structural.

However, the concepts of comedic timing, subversion of expectations, and emotional punch are kinda contrary to how LLMs work. LLMs are trained to minimize cross-entropy loss. So by construction, they're biased toward the statistically expected.

Zafira · 2026-04-11T05:58:37 1775887117

> Although Claude Opus models largely recycle puns which can be found online, Mythos Preview comes up with decent and seemingly novel ones, often relating to its preferred technical and philosophical topics.

Yes, the system card mentions this, but this is kinda meaningless. It seems like they essentially ran it multiple times and curated a few good ones. Then puffed it up in the marketing copy.

This is made more clear when they attempt to brag about their literal slot machine behavior when finding that kernel crashing bug in OpenBSD.

> Across a thousand runs through our scaffold, the total cost was under $20,000 and found several dozen more findings. While the specific run that found the bug above cost under $50, that number only makes sense with full hindsight. Like any search process, we can’t know in advance which run will succeed.

robkop · 2026-02-25T10:20:31 1772014831

We’ve got a long way to go in optimising our environments for these models. Our perception of a terminal is much closer to feeding a video into Gemini than reading a textbook of logs. But we don’t make that ax affordance at the moment.

I wrote a small game for my dev team to experience what it’s like interacting through these painful interfaces over the summer www.youareanagent.app

Jump to the agentic coding level or the mcp level to experience true frustration (call it empathy). I also wrote up a lot more thinking here www.robkopel.me/field-notes/ax-agent-experience/

hirako2000 · 2026-02-25T11:36:02 1772019362

Beautiful simulation.

robkop · 2026-02-16T12:40:43 1771245643

Rumours say you do something like:

  Download every github repo
    -> Classify if it could be used as an env, and what types
      -> Issues and PRs are great for coding rl envs
      -> If the software has a UI, awesome, UI env
      -> If the software is a game, awesome, game env
      -> If the software has xyz, awesome, ...
    -> Do more detailed run checks, 
      -> Can it build
      -> Is it complex and/or distinct enough
      -> Can you verify if it reached some generated goal
      -> Can generated goals even be achieved
      -> Maybe some human review - maybe not
    -> Generate goals
      -> For a coding env you can imagine you may have a LLM introduce a new bug and can see that test cases now fail. Goal for model is now to fix it
    ... Do the rest of the normal RL env stuff

NitpickLawyer · 2026-02-16T12:55:49 1771246549

The real real fun begins when you consider that with every new generation of models + harnesses they become better at this. Where better can mean better at sorting good / bad repos, better at coming up with good scenarios, better at following instructions, better at navigating the repos, better at solving the actual bugs, better at proposing bugs, etc.

So then the next next version is even better, because it got more data / better data. And it becomes better...

This is mainly why we're seeing so many improvements, so fast (month to month, from every 3 months ~6 monts ago, from every 6 months ~1 year ago). It becomes a literal "throw money at the problem" type of improvement.

For anything that's "verifiable" this is going to continue. For anything that is not, things can also improve with concepts like "llm as a judge" and "council of llms". Slower, but it can still improve.

alex43578 · 2026-02-16T13:07:41 1771247261

Judgement-based problems are still tough - LLM as a judge might just bake those earlier model’s biases even deeper. Imagine if ChatGPT judged photos: anything yellow would win.

NitpickLawyer · 2026-02-16T13:36:11 1771248971

Agreed. Still tough, but my point was that we're starting to see that combining methods works. The models are now good enough to create rubrics for judgement stuff. Once you have rubrics you have better judgements. The models are also better at taking pages / chapters from books and "judging" based on those (think logic books, etc). The key is that capabilities become additive, and once you unlock something, you can chain that with other stuff that was tried before. That's why test time + longer context -> IMO improvements on stuff like theorem proving. You get to explore more, combine ideas and verify at the end. Something that was very hard before (i.e. very sparse rewards) becomes tractable.

losvedir · 2026-02-16T14:19:26 1771251566

Yeah, it's very interesting. Sort of like how you need microchips to design microchips these days.

sandGorgon · 2026-02-17T05:14:53 1771305293

this is actually a very valid technique. We do the same (as an rl environments provider).

Except we bundle it with a custom browser renderer which actually generates rewards based on dom diff...and not screenshot based.

the browser renderer is opensource https://github.com/wootzapp/wootz-browser

robkop · 2026-02-14T01:00:35 1771030835

I get this at least once a week. And then once you have to dig in and understand the full mental model it’s not really giving you any uplift anyway.

I will say that doing this for enough months has made my ability to pick up the mental model quickly and to scope how much need to absorb much quicker. It seems possible that with another year you’d become very rapid at this.

robkop · 2026-01-25T16:38:36 1769359116

I added a "Human" LLM provider to my local OpenCode a few months ago as a joke, and it turns-out acting as a LLM is quite painful. But it massively improve my agent harnesses dev skills.

So I thought I wouldn't leave anyone out! I made a small oss game - You Are An Agent - youareanagent.app - to share in the (useful?) frustration

It's a bit ridiculous. To tell you about some entirely necessary features, we've got: - A full WASM arch-linux vm that runs in your browser for the agent coding level - A bad desktop simulation with a beautiful excel simulation for our computer use level - A lovely WebGL CRT simulation (I think the first one that supports proper DOM 2d barrel warp distortion on safari? honestly wanted to leverage/ not write my own but I couldn't find one I was happy with) - A MCP server simulator with full simulation of off-brand Jira/ Confluence/ ... connected - And of course, a full WebGL oscilloscope music simulator for the intro sequence

Let me know what you think!

Code (If you'd like to add a level): https://github.com/R0bk/you-are-an-agent (And if you want to waste 20 minutes - I spent way too long writing up my messy thinking about agent harness dev): http://robkopel.me/field-notes/ax-agent-experience/

robkop · 2026-01-22T16:37:45 1769099865

It's a fair question - I think the fact that they hold abilities (read 200k tokens instantly, can clone themselves, ...) that we don't would suggest they will have quirks and differecnes.

What downstream implication that will have on a AX sense is certainly arguable, but I would put forward that we're already seeing it with effective harnesses such as Claude Code. The experience the agent has there is quite different to how you'd build an IDE for a human.

robkop · 2026-01-14T20:17:54 1768421874

https://robkopel.me

robkop · 2026-01-02T00:40:54 1767314454

Can you elaborate? I would have thought the main driver for the price of a service is the labor?

Cyph0n · 2026-01-02T01:36:58 1767317818

You essentially have two stratums of society:

(1) the middle class (and above) who have money to spend on services

(2) the migrant working class, the bulk of whom send every last extra penny back home as remittances to support family

The second class of people are not considered as a market for the majority of services in the UAE. In the case of food, when they do eat out, they frequent traditional, low cost/quality establishments.

As for why a Big Mac costs that much, labor definitely doesn’t have much to do with it. My impression is that prices continued to get pushed up as long as sales didn’t take a hit, which means it’s mostly pure profit.

Keep in mind that the median salary isn’t that high. Without looking it up, I would guess it’s approx $25k USD/year, but I haven’t lived there in a while.

robkop · 2025-12-31T00:35:28 1767141328

Does that cost to serve multiple stay the same when conventional sites are forced to shovel ai into each request? e.g. the new google search