It's a bit different because the AI is reading it with the intent of reproducing (certain aspects of) it for other people to later consume without visiting the original site. Fair use doctrine has long held that small pieces of copyrighted material can be reproduced, but the line is very blurry and generally has to be litigated if there's any ambiguity whatsoever. I'd bet many of the models we're currently using today will be pulled from serving the public over copyright lawsuits in the coming years.
I don't think training on copyrighted stuff will be ever banned, but we need to figure out how much they can be allowed to generate based on that. Eventually new models will just pop up with more carefully curated data anyway.
> I don't think training on copyrighted stuff will be ever banned, but we need to figure out how much they can be allowed to generate based on that.
From a US copyright law point of view, this is most likely correct. Copyright law doesn't prevent you from ingesting copyrighted works, it prevents you from distributing them.
There is also a great deal of existing case law about how different a work has to be before it's not infringing from another work anymore. There are existing rules of thumb judges go by when trying to determine if infringement occurred. They include things like the amount of difference in expression, the quantity, whether or not it's incidental, etc.
And that's not even getting into the question of fair use -- which is a whole other kettle of fish.
I suspect that the courts will deal with these issues the way that they've always dealt with these issues: on a case-by-case basis.
Sure, but that would be illegal too. I'm saying it doesn't matter who reads your website, but everyone knows exactly why GPT and Bard are going to do with the information they're "learning" from it, so they're trying to block it from reading in the first place.
Many LLM's will happily recite large segments of copyrighted material word-for-word, despite the fact that it can be difficult to tell what's happening "under the hood".
> It's a bit different because the AI is reading it with the intent of reproducing (certain aspects of) it for other people to later
It could be illegal if the AI reproduces vast portions of it. If you could ask the LLM over a course of prompts to generate a significant portion of the content (as the copyright law defines it), then yes.
As long as the AI isn't reproducing it, then I am not sure if it would count.
Scale and position matter. Google is the conduit that connects most people to most websites, so in the EU they are considered a "gatekeeper" and need to be careful about conflicts of interest with the people and websites using their "gate". I hope American competition law catches up to the point we can recognize that market makers simply should not be participating in the markets they make (and Google search is a market maker; it's connecting "buyers" [viewers or advertisers, depending on your perspective] to "sellers" [websites or viewers, respectively]), but I digress.
The point is that Google has a certain market position that makes it very different when they "recite the vague plot of a novel or a fact they learned". The point of competition law is to "distort" free market capitalism for the betterment of society. This is one of those cases where practical considerations trump information idealism. The quality of information on the internet will go down if we stop rewarding original publishers.
I don't think training on copyrighted stuff will be ever banned, but we need to figure out how much they can be allowed to generate based on that. Eventually new models will just pop up with more carefully curated data anyway.