what bothers me is not that this issue will certainly disappear now that it has been identified, but that that we have yet to identify the category of these "stupid" bugs ...
We already know exactly what causes these bugs. They are not a fundamental problem of LLMs, they are a problem of tokenizers. The actual model simply doesn't get to see the same text that you see. It can only infer this stuff from related info it was trained on. It's as if someone asked you how many 1s there are in the binary representation of this text. You'd also need to convert it first to think it through, or use some external tool, even though your computer never saw anything else.
> It's as if someone asked you how many 1s there are in the binary representation of this text.
I'm actually kinda pleased with how close I guessed! I estimated 4 set bits per character, which with 491 characters in your post (including spaces) comes to 1964.
Then I ran your message through a program to get the actual number, and turns out it has 1800 exactly.
>I estimated 4 set bits per character, which with 491 characters in your post (including spaces) comes to 1964
And that's exactly the kind of reasoning an LLM does when you ask it about characters in a word. It doesn't come from the word, it comes from other heuristics it picked up during training.
Okay but, genuinely not an expert on the latest with LLMs, but isn’t tokenization an inherent part of LLM construction? Kind of like support vectors in SVMs, or nodes in neural networks? Once we remove tokenization from the equation, aren’t we no longer talking about LLMs?
It's not a side effect of tokenization per se, but of the tokenizers people use in actual practice. If somebody really wanted an LLM that can flawlessly count letters in words, they could train one with a naive tokenizer (like just ascii characters). But the resulting model would be very bad (for its size) at language or reasoning tasks.
Basically it's an engineering tradeoff. There is more demand for LLMs that can solve open math problems, but can't count the Rs in strawberry, than there is for models that can count letters but are bad at everything else.
Imagine if you had an auto cake making machine that decides on its own the best time to make cake. It adds the ingredients, stirs, turns the oven on, and leaves the finished cake on the counter for you.
People start opening bakeries consisting entirely of cakes baked by the automatic machines. The owners of these machines have no idea whether the cakes have a bit too much flour or were slightly over-stirred. In some cases, they haven't even tried the cakes.
Who gets to claim they made the cake?
By contrast, there are others who carefully tune their machines to make sure everything is perfect. They adjust the mixing settings and ingredient proportions. They experiment and iterate. They taste test throughout the process. And what they give to the public tastes every bit as good as a homemade cake.
The first group is creating slop. The second group, I think, is baking. And OP is in the second group.
Replace "oven" with a dish washer or a washing machine for your clothes. Those things do exactly all of this. Yet we still complain about washing clothes and doing the dishes, even though it is far less effort than anything our parents did, or their parents before them.
If you commission a baker, another person, with wants and desires of their own, is involved.
If you use an AI, there isn't.
Either way, it's clear that the author (yes, the author) put a lot of work into this by iterating and shaping it to what he wanted, and that's a lot more than sprinkles.
> If you commission a baker, another person, with wants and desires of their own, is involved.
> If you use an AI, there isn't.
What is the functional difference here? You are commissioning (see: prompting) someone (see: an AI) for a piece of work, or artwork or whatever. The output is out of your control; and I don't think the existence or lack thereof of a human on the other end materially matters.
If we had hyper-advanced ovens from The Jetsons where we could type a prompt using a fold-out keyboard and it would magically generate whatever cake we ask of it: did we or did we not bake that cake? And I do not think it is clear the author put a lot of work iterating and shaping it into what he wanted; we have zero insight into that.
I didn't say the difference was functional. If you don't think the presence of a human on the other end matters (materially or not), feel free to continue this conversation with an LLM simulation of me. You can even prompt it so that you logically triumph and convince "me".
I'm asking you to explain what the actual difference is and you're avoiding the question.
If we had a complete black box where you submitted Prompt and out came Thing, and you had zero clue what said black box actually did, could you claim creation over Thing? What does knowing that it's a human vs LLM make materially different in terms of whether or not you created it?
Why would I give him the same credit I would give a writer.
Or why would I give a writer the same credit I would give someone who created the AI prompts and scaffolding to generate this?
Being unhappy about not being able to call oneself an author, ends up betraying a lack of confidence in the work or process.
In the end writer, dancer, actor, whatever - these titles come from their impact.
There will be a different name for this, and eventually there will be something made that is good enough that people will be spell bound. At which point its going to be named something else.
Ironically, the story can be read as gesturing in that direction, as it's ostensibly about giving a new title to a particular job.
In general, though, I think part of the mistake people keep making is that they try to imitate what would be value to engage with if a human wrote it, in an attempt to claim the role of an author of a book or whatever. There's likely artforms that are unique to what an LLM can facilitate, but trying to imitate human artforms is going to give you stunted results. The AI is very good at imitating the form but not the substance.
Once we stop trying to generate and pass off AI essays, novels, choose your own adventure stories, and all the other human genres as being human writing, we'll have a chance to figure out actually interesting artistic forms.
> Creating something without the effort previous works involved, can and do affect the context and understanding of it
not really. Unless you place value on _effort_, rather than be objectively outcome based. Someone digging a hole with a spoon doesn't make it a better hole than a jackhammer.
I maintain that the work itself - that is, the contents of what is being expressed - is the sole judgement of how good the works is. Not the authorship, LLM-usage or otherwise.
The context exists whether it's LLM generated or not, because the context sits broadly in society, culture, and manifests in the mind of the reader.
> how would LLMs fair when the content of the work itself is about “Something made by a human”.
it would fair just as well as if the same words had been written by a human, provided the contents are sound and has good meaning - conversely, slop is slop, regardless if it was written by an LLM or human.
My point at the grandparent post is that there's a lot of blind discrimination on the origin of a works - if it was written by or with the help of LLM, then it automatically deserves less attention, and/or its content's worth diminished. All without actually discussing the content.
losers, clueless never had to be productive, just scapegoats. But now losers dont get that buffer window to try and become sociopaths, they just dont get hired at all.
Based on Karpathy’s writeup the auto research would not have found this. He tells the agent to improve the model and training loop with a five minute time limit, but honestly this “hack” is so far out of distribution that it seems really unlikely an agent would find this.
Adding, swapping, or duplicating layers has a long history (eg. StyleGAN, upcycling), and it was pointed out at least as far back as He et al 2015 (Resnets) that you could ablate or add more layers because they functioned more as just doing some incremental compute iteratively, and many of them were optional. (Or consider Universal Transformers or heck, just how BPTT works.) So this idea is not far out of distribution, if at all, especially if you're a LLM who knows the literature and past approaches (which most humans would not because they only just got into this area post-ChatGPT).
My opinion is you’d have to go pretty far down the x axis to get to anything that’s not things like tinkering with bs, lr, or positional encodings. There are so many hyperparameter knobs already exposed that duplicating layers is unlikely to be proposed for a long time.
I also just noticed that the last change it applied was changing the random seed. Lol.
My understanding was that Autoresearch was defined as training from scratch (since it's based on the nanogpt speedrun), not using any pretrained models. So it couldn't do anything like upcycling a pretrained model or the Frankenmerge, because it's not given any access to such a thing in the first place. (If it could, the speedrun would be pointless as it would mostly benchmark what is the fastest fileserver you can download a highly compressed pretrained model checkpoint from...) It can increase the number of layers for a new architecture+run, but that's not the same thing.
If you need to do the latter to be able to make money on the former, then you're not making money. Because if the latter requirement would disappear, inference margins would also drop.
At the end of the day, they're still burning cash. Even if inference is cheap, it's also not hard to compete on. They aren't going to be a trillion dollar
inference company.
Eventually there will be a race to the bottom on inference price to the customer by companies that aren't trying to subsidize their GPU investments.
OpenAI is spending money because they think they need to for their business to survive. They're hoping that the next big breakthrough just requires more compute and, somehow, that'll build them a moat.
OpenAI and quite honestly the others think they are in a race to AGI not the bottom. That's why they aren't concerning themselves with moats or cost. This is quite simply a massive bet that we've already cracked AGI and the rest is just funding the engineering to make it happen.
I personally think we haven't cracked AGI yet but it doesn't change their calculus.
Your goal.md examples are all features for the existing codebase. Any largish goal.md examples where your system is able to 1 shot a pretty large app?
The goal.md is what makes this thing either amazing or terrible for the user, so any guidelines or clear examples on writing a good one would go a long way.
author here! Good suggestion, we should probably come up with some GOAL.md examples. With that said, one-shotting a pretty large app is a somewhat doable task, and that's one of the reasons we have introduced the interview step: exactly to let the model pull from you (instead you pushing into the model a spec document) what it needs to know to be able to work autonomously.
i think it would be technically possible, although with videos a tiny change in pixels at 1000x zoom causes the whole screen to flash different sub-videos rapidly. an infinite photo zooming effect would remain more consistent
reply