The problem is that this is completely false. LLMs are actually deterministic. There are a lot more input parameters than just the prompt. If you're using a piece of shit corpo cloud model, you're locked out of managing your inputs because of UX or whatever.
Ah, we've hit the rock bottom of arguments: there's some unspecified ideal LLM model that is 100% deterministic that will definitely 100% do the same thing every time.
We've hit rock bottom of rebuttals, where not only is domain knowledge completely vacant, but you can't even be bothered to read and comprehend what you're replying to. There is no non-deterministic LLM. Period. You're already starting off from an incoherent position.
Now, if you'd like to stop acting like a smug ass and be inquisitive as per the commenting guidelines, I'd be happy to tell you more. But really, if you actually comprehended the post you're replying to, there would be no need since it contains the piece of the puzzle you aren't quite grasping.
Strange then that the vast majority of LLMs that people use produce non-deterministic output.
Funnily enough I had literally the same argument with someone a few months back in a friends group. I ran the "non-shitty non-corpo completely determenistic model" through ollama... And immediately got two different answers for the same input.
> Now, if you'd like to stop acting like a smug ass and be inquisitive as per the commenting guidelines,
Ah. Commenting guidelines. The ones that tell you not to post vague allusions to something, not to be dismissive of what others are saying, responding to the strongest plausible interpretation of someone says etc.? Those ones?
> Strange then that the vast majority of LLMs that people use produce non-deterministic output.
> I ran the "non-shitty non-corpo completely determenistic model" through ollama... And immediately got two different answers for the same input.
With deterministic hardware in the same configuration, using the same binaries, providing the same seed, the same input sequence to the same model weights will produce bit-identical outputs. Where you can get into trouble is if you aren't actually specifying your seed, or with non-deterministic hardware in varying configurations, or if your OS mixes entropy with the standard pRNG mechanisms.
Inference is otherwise fundamentally deterministic. In implementation, certain things like thread-scheduling and floating-point math can be contingent on the entire machine state as an input itself. Since replicating that input can be very hard on some systems, you can effectively get rid of it like so:
ollama run [whatever] --seed 123 --temperature 0 --num-thread 1
A note that "--temperature 0" may not strictly be necessary. Depending on your system, setting the seed and restricting to a single thread will be sufficient.
These flags don't magically change LLM formalisms. You can read more about how floating point operations produce non-determinism here:
In this context, forcing single-threading bypasses FP-hardware's non-associativity issues that crop up with multi-threaded reduction. If you still don't have bit-replicated outputs for the same input sequence, either something is seriously wrong with your computer or you should get in touch with a reputable metatheoretician because you've just discovered something very significant.
> Those ones?
Yes those ones. Perhaps in the future you can learn from this experience and start with a post like the first part of this, rather than a condescending non-sequitur, and you'll find it's a more constructive way to engage with others. That's why the guidelines exist, after all.
> These flags don't magically change LLM formalisms. You can read more about how floating point operations produce non-determinism here:
Basically what you're saying is "for 99.9% of use cases and how people use them they are non-deterministic, and you have to very carefully work around that non-determinism to the point of having workarounds for your GPU and making them even more unusable"
> In this context, forcing single-threading bypasses FP-hardware's non-associativity issues that crop up with multi-threaded reduction.
Translation: yup, they are non-deterministic under normal conditions. Which the paper explicitly states:
--- start quote ---
existing LLM serving frameworks exhibit non-deterministic behavior: identical inputs can yield different outputs when system configurations (e.g., tensor parallel (TP) size, batch size) vary, even under greedy decoding. This arises from the non-associativity of floating-point arithmetic and inconsistent reduction orders across GPUs.
--- end quote ---
> If you still don't have bit-replicated outputs for the same input sequence, either something is seriously wrong with your computer or you should get in touch with a reputable metatheoretician because you've just discovered something very significant.
Basically what you're saying is: If you do all of the following, then the output will be deterministic:
- workaround for GPUs with num_thread 1
- temperature set to 0
- top_k to 0
- top_p to 0
- context window to 0 (or always do a single run from a new session)
> The problem is that this is completely false. LLMs are actually deterministic. There are a lot more input parameters than just the prompt. If you're using a piece of shit corpo cloud model, you're locked out of managing your inputs because of UX or whatever.
When you decide to make up your own definition of determinism, you can win any argument. Good job.
It's important to note that not every research area ends up being a surface-language, and oftentimes research projects remain in-progress for a long time. There does exist a freely available research implementation of a 1ML interpreter (though slightly behind the language's formalization) offered by the author:
The thing is that this is a research prototype, not a real compiler. It's not usable in the same degree as a language like SML or Haskell. There is a lot more work beyond a grammar that goes into creating a compiler for a high level language.
I'm someone who has used Prolog in the past, but this is the first time I'm learning of Futamura's work[1]. I knew it was great for building executable grammars, but I hadn't ever really tried to do so thus have absolutely no knowledge on the usual techniques. What an absolutely fascinating methodology, I can see exactly how it maps to Prolog.
As a pragmatic type, I find it endlessly disappointing how many other pragmatic types have absolutely zero familiarity or grounding in even surface level theoretic stuff that academic types are doing.
Yes, that's an invasion! And the Japanese invaded the Aleutian Islands during WWII. Taking even a barely inhabited frozen rock by force is an invasion. Has nothing to do with the scale of destruction.
It's a confluence of various factors. Explosive population growth, for example. The modern economy (of which fiat currency plays a pivotal role) relies on that of course, as the lending system is a bet on future growth. If that fails the whole thing can enter a state of catastrophic failure. But population growth has more precedence. Fiat currency, bureaucratization, etc. were adopted as reactions to increasingly explosive populations and unchecked rationalism developing the absolutely ridiculous modern state system.
If you want demons to point a finger at, you're going to have to look further back in time than the 20th century. Then and now we're just doing a frantic tap dance to keep what we inherited from catching on fire.
Huh, what? Population increased a lot in the 19th century, and many countries did not have fiat currencies back then; and the price level most went down slowly as the population grew.
(Modern day 2%-ish stable inflation is mostly fine for the economy, even if it technically erodes the value of money in the long term. The classic pre-WW1 gold standard was also fine-ish. The Frankenstein gold standard-ish they until the 1970s was bad. And so was the rampant inflation that followed for a while.)
I specifically mentioned that population growth precedes fiat currency. Where's your confusion? I'm explicitly telling you to broaden your perspective and look at overarching political currents across the centuries succeeding the renaissance. For instance many countries also were not so extensively bureaucratized, particularly in how they interfaced with the public, until the late 19th century and early 20th century.
Political evolution is spread over many years and is structurally anisotropic. Metallism's death was inevitable by the 18th century at best, but don't misunderstand that to mean it was going to happen immediately. It's also just a symptom. The enlightenment's political revolution is a manifold spread across centuries. Don't just look at the symptoms, you won't understand anything and it will lead you to half-baked conclusions.
Getting off to images of child abuse (simulated or not) is a deep violation of social mores. This itself does indeed constitute a type of crime, and the victim is taken to be society itself. If it seems unjust, it's because you have a narrow view of the justice system and what its job actually is (hint: it's not about exacting controlled vengeance)
It may shock you to learn that bigamy and sky-burials are also quite illegal.
I'm not sure I agree with this specific reasoning. Consider this, any given image viewer can display CSAM. Is it a CSAM viewer? Do you have a moral responsibility to make it refuse to display CSAM? We can extend it to anything from graphics APIs, to data storage, etc.
There's a line we have to define that I don't think really exists yet, nor is it supported by our current mental frameworks. To that end, I think it's just more sensible to simply forbid it in this context without attempting to ground it. I don't think there's any reason to rationalize it at all.
reply