I care that it's within the ballpark I spent considerable detail explaining. I don't care where inside the ballpark it is.
You gave an exaggerated upper limit, so extreme there's no ambiguity, of "entire repo".
I gave my own exaggerated upper limit, so extreme there's no ambiguity. And mine has examples of it actually happening. Incidents so extreme they're clear violations.
Maybe an analogy will help: The point at which a collection of sand grains becomes a heap is ambiguous. But when we have documented incidents involving a kilogram or more of sand in a conical shape, we can skip refining the threshold and simply declare that yes heaps are real. Incidents of major LLMs copying code, in a way that is full-on memorization and not just recreating things via chance and general code knowledge, are real.
You're the only person I've seen ever imply that true copying incidents are a statistical illusion, akin to a random die. Normally the debate is over how often and impactful they are, who is going to be held responsible, and what to do about them.
To recap, the original statement was, "Llm's do not verbatim disgorge chunks of the code they were trained on." We obviously both disagree with it.
While you keep trying to drag this toward an upper bound, I'm trying to illustrate that a coin with "//" reproduces a chunk of code. Again. I don't see much of a disagreement on that point either. What I continue to fail to elicit from you is the salient difference between the two.
I'm trying to find a scissor that distills your vibes into a consistent rule and each time it's the rebutted like I'm trying to make an argument. If your system doesn't have consistency, just say so.
I have a consistent rule. The rule is that if an LLM meets the threshold I set then it definitely violated copyright, and if it doesn't meet the threshold then we need more investigation.
We have proof of LLMs going over the threshold. So that answers the question.
Your illustrations are all in the "needs more investigation" area and they don't affect the conclusion.
We both agree that 1 token by itself is fine, and that some number is too many.
So why do you keep asking about that, as if it makes my argument inconsistent in some way? We both say the same thing!
We don't need to know the exact cutoff, or calculate how it varies. We only need to find violators that are over the cutoff.
How about you tell me what you want me to say? Do you want me to say my system is inconsistent? It's not. Having an area where the answer is unclear means the system is not able to answer every question, but it doesn't need to answer every question.
If you're accusing me of using "vibes" in a way that ruins things, then I counter that no I give nice specific and super-rare probabilities that are no more "vibes" based than your suggestion of an entire repo.
> What I continue to fail to elicit from you is the salient difference between the two.
Between what, "//" and the threshold I said?
The salient difference between the two is that one is too short to be copyright infringement and the other is so long and specific that it's definitely copyright infringement (when the source is an existing file under copyright without permission to copy). What more do you want?
Just like 1 grain of sand is definitely not a heap and 1kg of sand is definitely a heap.
If you ask me about 2, 3, 20 tokens my answer is I don't care and it doesn't matter and don't pretend it's relevant to the question of whether LLMs have been infringing copyright or not ("verbatim disgorge chunks").