It's a bit different because the AI is reading it with the intent of reproducing...

JohnFen · on July 14, 2023

> I don't think training on copyrighted stuff will be ever banned, but we need to figure out how much they can be allowed to generate based on that.

From a US copyright law point of view, this is most likely correct. Copyright law doesn't prevent you from ingesting copyrighted works, it prevents you from distributing them.

There is also a great deal of existing case law about how different a work has to be before it's not infringing from another work anymore. There are existing rules of thumb judges go by when trying to determine if infringement occurred. They include things like the amount of difference in expression, the quantity, whether or not it's incidental, etc.

And that's not even getting into the question of fair use -- which is a whole other kettle of fish.

I suspect that the courts will deal with these issues the way that they've always dealt with these issues: on a case-by-case basis.

KoolKat23 · on July 14, 2023

But you don't know what the intention of the reader human is either? It could be that too?

bloppe · on July 14, 2023

Sure, but that would be illegal too. I'm saying it doesn't matter who reads your website, but everyone knows exactly why GPT and Bard are going to do with the information they're "learning" from it, so they're trying to block it from reading in the first place.

KoolKat23 · on July 14, 2023

They're not doing much they're updating probabilities on a regression model. What the user of the tool thereafter do is the question.

bloppe · on July 14, 2023

Many LLM's will happily recite large segments of copyrighted material word-for-word, despite the fact that it can be difficult to tell what's happening "under the hood".

KoolKat23 · on July 14, 2023

Many people can do that too? It's what they do with it that's important.

amf12 · on July 14, 2023

> It's a bit different because the AI is reading it with the intent of reproducing (certain aspects of) it for other people to later

It could be illegal if the AI reproduces vast portions of it. If you could ask the LLM over a course of prompts to generate a significant portion of the content (as the copyright law defines it), then yes.

As long as the AI isn't reproducing it, then I am not sure if it would count.

arsome · on July 14, 2023

If I recite the vague plot of a novel or a fact I learned from an encyclopedia I'm not reproducing anything, certainly not violating copyright law.

I don't see why AI developers should be expected to think otherwise and worsen their training data over this.

bloppe · on July 14, 2023

Scale and position matter. Google is the conduit that connects most people to most websites, so in the EU they are considered a "gatekeeper" and need to be careful about conflicts of interest with the people and websites using their "gate". I hope American competition law catches up to the point we can recognize that market makers simply should not be participating in the markets they make (and Google search is a market maker; it's connecting "buyers" [viewers or advertisers, depending on your perspective] to "sellers" [websites or viewers, respectively]), but I digress.

The point is that Google has a certain market position that makes it very different when they "recite the vague plot of a novel or a fact they learned". The point of competition law is to "distort" free market capitalism for the betterment of society. This is one of those cases where practical considerations trump information idealism. The quality of information on the internet will go down if we stop rewarding original publishers.