Hacker Newsnew | past | comments | ask | show | jobs | submit | saeranv's commentslogin

> Why is the 'true process' changing here? I understand our best guess or model is changing with new observations, but the true process should not be changing. If it actually is, then the formulation should be changed to isolate the parameters that is feeding back to it.

He's not saying the true process is changing, just the functions that are being sampled from the GP. The true process refers to the true, underlying function so it's deterministic if you have correctly identified all its inputs.

> So is the shape of each function changing?

Yes, the function changes shape as you get more data because the parameters governing that function (that we define in the kernel) are updated with new observational data, so that over time it converges to the 'true' process/function we are trying to discover.

> What is the 'distribution' over the functions doing? Is that also changing? Is the said 'distribution' just flat mean of these functions?

I think you're confused because the example given with cheese is really confusing when we're trying to understand the functions as arising from a multivariate distribution. So, I'll try to clarify that part. GPs are typically used to represent some function where the input is time or distance. This is why its called a 'process' - because the variables in a random process are indexed by space or time. So in this 1D example, in the X domain, [x1, x2, x3] represents something like fixed increments of increasing cheese. f(X) represents the gold amount. Now imagine gold can take any value from 0-100. Now plot all possible values of f(x1) on the x axis of a grid, f(x2) on the y-axis of the grid, and f(x3) on the z-axis of the grid. We have 100^3 points in this 3D grid. If we select one point, it's x,y,z coordinates correspond to the f(x1), f(x2) and f(x3) gold amounts. The dimension index, corresponds (typically) to something like time, or distance. In this example it's cheese.

In a GP, we're modeling the sampled f(X) point as if its from a 3D multivariate normal distribution. So sampling one point gives us the gold amount for cheese amount 1, 2, and 3. This is the 'function', and as we sample more points, we get more 'functions' that give us varying gold amounts for cheese amount 1, 2, and 3. And because it's a multivariate distribution, we can capture correlations between dimensions, so the amount of gold you get for cheese-1, should influence how much gold you get at cheese-2 because its close by. This relationship is defined by the covariance function of the gaussian.

> GP(m(x), k(x, x')) What is 'x' here? (Sigh! We need to learn to define the variables before using.) I can infer that x' is not derivative of x.

x refers to some amount of gold, and k(x, x') just means that the kernel consumes any two values in our X vector (i.e. [x1, x3] or [x1, x2]).

> "In the context of GPs, a kernel or covariance function k(x, x') = Cov(f(x), f(x')), encodes which function values should vary together." It does not seem the 'f' here is intended to be the specific 'f' introduced at the beginning of the article.

I believe it is the same f actually. He's saying the kernel function takes in two values of x (cheese), and outputs the covariance between their output gold amounts. This illustrates his previous point that the "closeness" between x values should be reflected in the gold amounts.

> The plots now have y and x, and x1 and x2. How are these related?

y is gold. x is cheese. x1, x2 correspond to the first two x-values in the linear plot.

> And with k(x, x') = Cov(f(x), f(x')), what is 'f' for the various kernel functions being plotted.

f(X) is the approximation of the "true" process we're trying to learn from observational data. The observations are tuples of cheese and gold amoutns, so f(x), f(x') is just the corresponding gold amount, we don't actually model that function explicitly. The gaussian distribution we are sampling from for functions just models correlations between our variables, so it represents the function implicitly.


Thanks. I read several times, and along with another response, I think I have a better understanding now, though still not having a complete grasp.

>> So sampling one point gives us the gold amount for cheese amount 1, 2, and 3. This is the 'function', and ...

I get this part, so each point in this N-dimensional space yields a function f of the index, and this is the function.

>> Yes, the function changes shape as you get more data because the parameters governing that function

Getting more data should now get more such points (in N-dimensional space), but with each such point being the 'function' how is it changing shape.

Nevertheless, I think I have much better glimpses after reading your and other other responses here than from the original article, which I still find confusing even on reading again.


I said before that the function shape changes as you're updating the parameters that govern the function but that's actually very misleading, (sorry), since the kernel parameters are only indirectly governing the function. What the parameters directly govern is the joint probability distribution P(f(x1), f(x2), ..., f(xn)). So the function f is implicitly defined by how likely the entire sequence of f values are.

So how does it change shape? Well this part is actually something I don't fully grasp myself yet. But I can sketch a crude bayesian interpretation, which is how I think of it. Not completely correct but works as a placeholder until I fully work out the math of updating the parameters.

Basically, from a bayesian perspective we can condition the joint distribution of function outputs as a likelihood conditioned on the kernel parameters theta: p(f(x1), f(x2), ... | theta).

Then we can derive the posterior distribution over theta p(theta | f(x1), f(x2), ...) like so:

p(theta | f(x1), f(x2), ...) ≈ p(f(x1), f(x2), ... | theta) p(theta).

So we fit the theta parameters based on how well it fits the observed data we feed our bayesian model.

FWIW, I recommend chapter 14 of Richard McElreath's Statistical Rethinking for a better introduction of GPs. This article kind of glosses over a lot of the intuition and introductory concepts that you need to really grok it.


Greg Brockman honestly sounds like a psychopath:

> In 2017, Amodei hired Page Hedley, a former public-interest lawyer, to be OpenAI’s policy and ethics adviser. In an early PowerPoint presentation to executives, Hedley outlined how OpenAI might avert a “catastrophic” arms race—perhaps by building a coalition of A.I. labs that would eventually coördinate with an international body akin to NATO, to insure that the technology was deployed safely. As Hedley recalled it, Brockman didn’t understand how this would help the company beat its competitors. “No matter what I said,” Hedley told us, “Greg kept going back to ‘So how do we raise more money? How do we win?’ ” According to several interviews and contemporaneous records, Brockman offered a counterproposal: OpenAI could enrich itself by playing world powers—including China and Russia—against one another, perhaps by starting a bidding war among them. According to Hedley, the thinking seemed to be, It worked for nuclear weapons, why not for A.I.?


For me, it was when I found out Greg Brockman's MAGA donations. From wikipedia (https://en.wikipedia.org/wiki/Greg_Brockman#Personal_life):

Brockman and his wife were the biggest donors to Donald Trump's Super PAC, MAGA Inc., in 2025 with each of them donating US$12.5 million. Brockman and his wife also donated $50 million to Leading the Future, a super PAC dedicated to AI deregulation that he helped found with Andreessen Horowitz co-founders Marc Andreessen and Ben Horowitz.


This is awesome. How do you integrate morphology into the simulation? Does morphology effect movement (via area friction or mass impact on momentum) or metabolism (via area/volume ratio)?


Thanks! Morphology does affect both movement and metabolism, but by design it’s modeled as biological abstraction rather than full physics.

Traits like spikes, tentacles, lobes, adhesion, and size add drag/weight, so complex or armored bodies are slower and less agile. Bigger bodies and complex structures also have higher metabolic upkeep. It’s not true fluid dynamics or area/volume math, but it’s tuned to produce the same evolutionary tradeoffs.

So you get: big and complex = strong but slow and hungry. Small and simple = fast and cheap but fragile.



That presumes that languages didn't evolve independently across different communities. The fact that different ancient languages have completely different grammatical structures, for example, provides some evidence of this.


> The fact that different ancient languages have completely different grammatical structures, for example, provides some evidence of this

It really doesn't provide that evidence. Proto-Afroasiatic the oldest agreed upon hypothetical proto-language probably only dates back 18,000 years. The modern brain, vocal, and tongue structures linked to complex speech were in place 100,000 years ago, and its thought that complex speech was in place by the time Homo Sapiens left Africa 50-70,000 years ago. That's a long time for grammar to diverge. Just in recorded history plenty of languages have gained and lost very complex grammatical features. Old Chinese for example was not a tonal language, but evolved tones. Small isolated languages can change rapidly, and trade languages tend to simplify.


A simple counter example here is instinctual behaviour. A sea turtle is born, and with little to no guidance, experimentation, or exploration heads to the sea. That knowledge is embedded at birth.

I think the analogy of the brain as hardware devices ("neural processor", "I/0 devices", etc) is misleading. I think I understand the very strict mind-matter dualism you're alluding to here. But so far attempts at using actual computer hardware to reproduce human-like cognition has gotten nowhere close, despite consuming order of magnitude more energy and data.


> The idea is that if you can produce an accurate probably distribution over the next bit/byte/token...

But how can you get credible probability distributions from the LLMs? My understanding is that the outputs specifically can't be interpreted as a probability distribution, even though superficially they resemble a PMF, due to the way the softmax function tends to predict close to 100% for the predicted token. You can still get an ordered list of most probable tokens (which I think beam search exploits), but they specifically aren't good representations of the output probability distribution since they don't model the variance well.


My understanding is that minimizing perplexity (what LLMs are generally optimized for) is equivalent to finding a good probably distribution over the next token.


I think they are accounting for the entire context, they specifically write out:

>> P(next_word|previous_words)

So the "next_word" is conditioned on "previous_words" (plural), which I took to mean the joint distribution of all previous words.

But, I think even that's too reductive. The transformer is specifically not a function acting as some incredibly high-dimensional lookup table of token conditional probabilities. It's learning a (relatively) small amount of parameters to compress those learned conditional probabilities into a radically lower-dimensional embedding.

Maybe you could describe this as a discriminative model of conditional probability, but at some point, we start describing that kind of information compression as semantic understanding, right?


It's reductive because it obscures just how complicated that `P(next_word|previous_words)` is, and it obscures the fact that "previous_words" is itself a carefully-constructed (tokenized & vectorized) representation of a huge amount of text. One individual "state" in this Markov-esque chain is on the order of an entire book, in the bigger models.


It doesnt matter how big it is, it's properties dont change. eg., it never says, "I like what you're wearing" because it likes what I'm wearing.

It seems there's an entire generation of people taken-in by this word, "complexity" and it's just magic sauce that gets sprinkled over ad-copy for big tech.

We know what it means to compute P(word|words), we know what it means that P("the sun is hot") > P("the sun is cold") ... and we know that by computing this, you arent actaully modelling the temperature of the sun.

It's just so disheartening how everyone becomes so anthropomorphically credulous here... can we not even get sun worship out of tech? Is it not possible for people to understand that conditional probability structures do not model mental states?

No model of conditional probabilities over text tokens, no matter how many text tokens it models, ever says, "the weather is nice in august" because it means the weather is nice in august. It has never been in an august; or in weahter; nor does it have the mental states for preference, desire.. nor has it's text generation been caused by the august weather.

This is extremely obvious, as in, simply refelect on why the people who wrote those historical text did so.. and reflect on why an LLM generates this text... and you can see that even if an LLM produced word-for-word MLK's I have a dream speech, it does not have a dream. It has not suffered any oppression; nor organised any labour; nor made demands on the moral conscience of the public.

This shouldnt need to be said to a crowd who can presumably understand what it means to take a distribution of text tokens and subset them. It doesnt matter how complex the weight structure of an NN is: this tells you only how compressed the conditional probability distribution is over many TBs of all of text history.


You're tilting at windmills here. Where in this thread do you see anyone taking about the LLM as anything other than a next-token prediction model?

Literally all of the pushback you're getting is because you're trivializing the choice of model architecture, claiming that it's all so obvious and simple and it's all the same thing in the end.

Yes, of course, these models have to be well-suited to run on our computers, in this case GPUs. And sure, it's an interesting perspective that maybe they work well because they are well-suited for GPUs and not because they have some deep fundamental meaning. But you can't act like everyone who doesn't agree with your perspective is just an AI hypebeast con artist.


ah, well there's actually two classes of replies and maybe i'm confusing one for the other here.

My claim regarding architecture follows just formally: you can take any statistical model trained via gd and phrase it as a kNN. The only difference is how hard it is to produce such a model from fitting to data, rather than from rephrasing.

The idea that there's something special about architecture is, really, a hardware illusion. Any empirical function approximation algorithm, designed to find the same conditional probability structure, will in the limit t->inf, approximate the same structure (ie., the actual conditional joint distribution of the data).


I think I see the crux of the disagreement.

> The idea that there's something special about architecture is, really, a hardware illusion. Any empirical function approximation algorithm, designed to find the same conditional probability structure, will in the limit t->inf, approximate the same structure (ie., the actual conditional joint distribution of the data).

But it's not just about hardware. Maybe it would be, if we had access to an infinite stream of perfectly noise-free training data for every conceivable ML task. But we also need to worry about actually getting useful information out of finite data, not just finite computing resources. That's the limit you should be thinking about: the information content of input data, not compute cycles.

And yes, when trying to learn something as tremendously complicated as a world-model of multiple languages and human reasoning, even a dataset as big as The Pile might not be big enough if our model is inefficient at extracting information from data. And even with the (relatively) data-efficient transformer architecture, even a huge dataset has an upper limit of usefulness if it contains a lot of junk noise or generally has a low information density.

I put together an example that should hopefully demonstrate what I mean: https://paste.sr.ht/~wintershadows/7fb412e1d05a600a0da5db2ba.... Obviously this case is very stylized, but the key point is that the right model architecture can make good use of finite and/or noisy data, and the wrong model architecture cannot, regardless of how much compute power you throw at the latter.

It's Shannon, not Turing, who will get you in the end.


text is not a valid measure of the world, so there is no "informative model" ie., a model of the data generating process to fit it to. there is no sine curve, indeed there is no function from world->text -- there are an infinite family of functions, none of which is uniquely sampled by what happens to be written down

transformers, certainly, arent "informative" in this sense: they start with no prior model of how text would be distributed given the structure of the world.

these arguments all make radical assumptions that we are in somethihng like a physics experiment -- rather than scraping glyphs from books and replaying their patterns


Perhaps you have misunderstood what the people you are talking about, mean?

Or, if not, perhaps you are conflating what they mean with something else?

Something doesn’t need to have had a subjective experience of the world in order to act as a model of some parts of the world.


I read through the whole LW post, and think there's enough troubling evidence here that she shouldn't be dismissed. It certainly shouldn't be flagged.

I initially was leaning to this being a high possibility of a delusion springing from a mentally unstable person, for all the reasons other commentators are mentioning. But, two things in particular struck me that changed my mind:

1. She apparently mentioned the abuse to her mother as a child.

2. She describes childhood behaviour consistent with someone who has experienced sexual abuse (i.e. thoughts of suicide, weird night behaviour like taking baths, body issues as she got older).

A small child doesn't have any incentive to make accusations, or to pretend to have been assaulted. If true, this should be taken seriously. Her mother is still alive, and there may be doctors, relatives or others that would be able to substantiate these points.

Finally, why has this post (and previous related posts) been repeatedly flagged? It's very troubling, I expect this from some HN users, but would have thought the HN moderators would have unflagged (or reposted) it upon consideration of the seriousness, importance of the subject matter, and undeniable relevance to the tech industry. At minimum, you would think someone would have unflagged them to avoid the appearance of bias and favorable treatment to the former YC president. At this point HN looks really sleazy.


I share your opinion on the likely veracity of the allegations and would also like an explanation for the flagging of this post and the repeated deletions of similar posts.

Why is HN taking a side here?

https://twitter.com/JOSourcing/status/1710390512455401888


Dang actually 'answered' this question yesterday on a six-week old flagged and buried thread [0]:

> Users flagged it. (That's the usual answer to this question, as explained in the FAQ: https://news.ycombinator.com/newsfaq.html.)

> (Posting this belatedly because the question has been coming up today.)

I know that doesn't actually answer your question. So does dang.

> undeniable relevance to the tech industry

Absolutely. I really hope dang reconsiders their inaction. They've been remarkably consistent on this topic though.

0 - https://news.ycombinator.com/item?id=37785072


It’s also a thing that I know for sure happens in families and gets swept under the rug. The trauma it causes is deep and complex and I don’t think as a society we have any idea how often it happens. My bet is it’s far more common than we know.


Even discounting the SA bits, the financial stuff she alleges are really douchey.

But, I don’t even know if this person is actually his sister.

On the internet, no one knows you’re a dog.


Sam acknowledged her as his sister. It would have been called out far earlier if false


I’m pretty sure he’s been on her podcast, and she’s posted photos of them together


> behaviour consistent with someone who has experienced sexual abuse (i.e. thoughts of suicide, weird night behaviour like taking baths, body issues as she got older)

Can you link to something about it? That behavior rings a bell

> Finally, why has this post (and previous related posts) been repeatedly flagged? It's very troubling, I expect this from some HN users

I'm not too sure if it's concerning YC in some way or just the techbro crowd being itself. Downvotes are typical here with child abuse related topics in general and especially if it concerns tech. But also there's a possibility that's just a random person and not really his sister.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: