I fucked with the title a bit. What i linked to was actually a mastodon post linking to an actual thing. but in my defense, i found it because cory doctorow boosted it, so, in a way, i am providing the original source here.
please argue. please do not remove.
Google scanned millions of books and made them available online. Courts ruled that was fair use because the purpose and interface didn’t lend itself to actually reading the books in Google books, but just searching them for information. If that is fair use, then I don’t see how training an LLM (which doesn’t retain the exact copy of the training data at least in the vast majority of cases) isn’t fair use. You aren’t going to get an argument from me.
I think most people who will disagree are reflexively anti AI, and that’s fine. But I just haven’t heard a good argument that AI training isn’t fair use.
here’s a sidechannel attack on your position: every use, even infringing uses, are fair use until adjudicated, because what fair use means is that a court has agreed that your infringing use is allowed. so of course ai training (broadly) is always fair use. but particular instances of ai training may be found to not be fair use, and so we can’t be sure that you are always going to be right (for the specific ai models that may come into question legally).
“Its perfectly legal unless you get caught!”
Here’s another good one: https://www.eff.org/deeplinks/2023/04/how-we-think-about-copyright-and-ai-art-0
What constitutes fair use?
17 U.S.C. § 107
Notwithstanding the provisions of sections 17 U.S.C. § 106 and 17 U.S.C. § 106A, the fair use of a copyrighted work, including such use by reproduction in copies or phonorecords or by any other means specified by that section, for purposes such as criticism, comment, news reporting, teaching (including multiple copies for classroom use), scholarship, or research, is not an infringement of copyright.
GenAI training, at least regarding art, is neither criticism, comment, news reporting scholarship, nor research.
AI training is not done by scientists but engineers of a corporative entity with a long term profit goal.
So, by elimination, we can conclude that none of the purposes covered by the fair use doctrine apply to Generative AI training.
Q.E.D.
“Such as” means that these are examples and not an exhaustive list.
Can you explain how the 3 factors you listed rule out scholarship or research purpose? Regarding the first factor, how do you determine that AI developers are all engineers and never computer scientists?
it is pretty obviously scholarship and research
Sure, that can be fair use, but only if using them can also be fair use
Agreed. I would also argue that trained model weights are not copyrightable.
They aren’t.
Courts have already ruled that copyright requires human creation, and weights are not decided by humans but by the training algorithms.
I didn’t know it was already settled law. But in that case, why are models like llama still released under licenses? If they are non-copyrightable, licenses should be unenforceable and therefore irrelevant.
The license is related to access.
Basically it’s gated and not publicly available, and the only way to open the gate is to say “I promise not to do anything outside what you are limiting me to do.”
A second person that gets access without agreeing to that can use the weights however they want (what copyright would relate to), but the person who gave them access to the weights would have been in breach of their agreement.
So separate things with different scopes.