Some valid points, but I hope the authors had developed them more.
On the semantic gap between the original software and its representation in the ITP, program extraction like in Rocq probably deserves some discussion, where the software is written natively in the ITP and you have to prove the extraction itself sound. For example, Meta Rocq did this for Rocq.
For the how far down the stack problem, there are some efforts from https://deepspec.org/, but it's inherently a difficult problem and often gets less love than the lab environment projects.
This specific example to me is less likely a consequence of model collapsing, but the "personality" adjustment about how aggressively it should read into the user's intention.
From time to time, I enjoy the model guessing what I meant rather than what I wrote. For example, "Find the backend.py" can be auto-corrected into "find the app.py".
> But let's hit the random button on wikipedia and pick a sentence, see if you can draw a picture to convey it, mm?
The inverse is also difficult. Pick a random 15 second movie clip, how to describe it using text without losing much of its essence? Or can one really port a random game into a text version? Can a pilot fly a plane with text-based instrument panel?
Text is not a superset of all communication media. They are just different.
Commercial aviation involves mostly textual interaction[1] to determine what the aircraft does, for most of the time. Aviation is rife with plain text, usually upper case for better legibility[2].
The program used to check the validity of a proof is called a kernel. It just need to check one step at a time and the possible steps can be taken are just basic logic rules. People can gain more confidence on its validity by:
- Reading it very carefully (doable since it's very small)
- Having multiple independent implementations and compare the results
- Proving it in some meta-theory. Here the result is not correctness per se, but relative consistency. (Although it can be argued all other points are about relative consistency as well.)
Checking the validity of a given proof is deterministic, but filling in the proof in the first place is hard.
It's like Chess, checking who wins for a given board state is easy, but coming up with the next move is hard.
Of course, one can try all possible moves and see what happens. Similar to Chess AI based on search methods (e.g. MinMax), there are proof search methods. See the related work section of the paper.
> imagine a folder full of skills that covers tasks like the following:
> Where to get US census data from and how to understand its structure
Reminds me of my first time using Wolfram Alpha and got blown away by its ability to use actual structured tools to solve the problem, compared to normal search engine.
tbh wolfram alpha was the craziest thing ever. haven't done much research on how this was implemented back in the day but to achieve what they did for such complex mathematical problems without AI was kind of nuts
I doubt that if the underlying parts changed, anyone outside the industry or enthusiasts would know what that is. How many people know what kind of engine is in their car? I stomp on the floor of my Corolla and away we go! Others might know that their Dodge Challenger has a Hemi. What even is that? Thankfully we have the Internet these days, and someone who's interested can just select the word and right click to Google for the Wikipedia article for it. AI is just such an entirely undefined term coloquially, that any attempts to define it will be wrong.
I think the difference now is that traditional software ultimately comes down to a long series of if/then statements (also the old AI's like Wolfram), whereas the new AI (mainly LLM's) have a fundamentally different approach.
Look into something like Prolog (~50 years old) to see how systems can be built from rules rather than it/else statements. It wasn't all imperative programming before LLMs.
If you mean that it all breaks down to if/else at some level then, yeah, but that goes for LLMs too. LLMs aren't the quantum leap people seem to think they are.
Yeah, the result is pretty cool. It's probably how it felt to eat pizza for the first time. People had been grinding grass seeds into flour, mixing with water and putting it on hot stones for millennia. Meanwhile others had been boiling fruits into pulp and figuring out how to make milk curdle in just the right way. Bring all of that together and, boom, you have the most popular food in the world.
We're still at the stage of eating pizza for the first time. It'll take a little while to remember that you can do other things with bread and wheat, or even other foods entirely.
Would really like something selfhosted that does the basic Wolfram Alpha math things.
Doesn't need the craziest math capability but standard symbolic math stuff like expression reduction, differentiation and integration of common equations, plotting, unit wrangling.
All with an easy to use text interface that doesn't require learning.
TI-89 has surprisingly good symbolics tools and solvers for something that runs all year on a single set of AAA batteries. Feels like magic alien tech.
I used it a lot for calc as it would show you how they got the answer if I remember right, also liked how it understands symbols which ibv but cool to paste an integral sign in there
> Some Chinese language source claims that it's a reaction to the Pakistan-US rare earth deal.
Maybe they approached India for a deal that was too lopsided in favour of US for the former to accept so US did the show-and-tell cozying up to Pakistan to get a better while publicly shitting on India? Just follow the money?
Xi is getting China ready to attack Taiwan in 2026 or 2027, and the now mutual unwinding of economic relations between the US and China is underway. Still frenemies at this point, but Trump is aiming for more enemy status sooner because it causes media drama and draws attention to him. The US will be screwed because domestic production takes years to happen and it has lost most of its machine tool suppliers, knowledge, and workers. Manufacturing productivity is essential for any sort of war, as evidenced by the history of the American Civil War and WW II.
If the US doesn't impeach and remove Trump and Vance, and get a real, war-time leader who isn't a celebrity reality star ASAP, it will be doomed as China will rapidly seize Taiwan, disrupt Western chip production and plunge the West into an economic armageddon, and likely widen to a war with Japan who would definitely intervene militarily to defend economic technological resources in Taiwan. No more incompetent, self-destructive, corrupt, ideologue chaos can be tolerated.
From the title I thought they solved math! Turns out to be a framework to use SMT solvers for decision-based proof. For additional types, you still need to write the bridging part. Interesting nonetheless.
Nice. When using OpenAI Codex CLI, I find the /compact command very useful for large tasks. In a way it's similar to the context editing tool. Maybe I can ask it to use a dedicated directory to simulate the memory tool.
Claude Code had this /compact command for a long time, you can even specify your preferences for compaction after the slash command. But this is quite limited and to get the best results out of your agent you need more than rely on how the tool decides to prune your context. I ask it explicitly to write down the important parts of our conversation into an md file, and I review and iterate over the doc until I'm happy with it. Then /clear the context and give it instructions to continue based on the MD doc.
Duolingo is useful, but not efficient. When people say I want to learn a language, they often mean I want to learn this language efficiently, e.g. to be able to write an essay like the post says after a realistic period of time.
I personally don't believe its pedagogical deficiency is mere incompetence. The whole business model is to keep you on the platform as long as possible, so why would they make you learn faster rather than just enough to keep you there?
As a long time user before, I have observed a lot of mechanism changes that bear out this observation.
On the semantic gap between the original software and its representation in the ITP, program extraction like in Rocq probably deserves some discussion, where the software is written natively in the ITP and you have to prove the extraction itself sound. For example, Meta Rocq did this for Rocq.
For the how far down the stack problem, there are some efforts from https://deepspec.org/, but it's inherently a difficult problem and often gets less love than the lab environment projects.
reply