OpenAI now tries to hide that ChatGPT was trained on copyrighted books, including J.K. Rowling's Harry Potter series

L4sBot@lemmy.world · 1 year ago

OpenAI now tries to hide that ChatGPT was trained on copyrighted books, including J.K. Rowling's Harry Potter series

TropicalDingdong@lemmy.world · 1 year ago

Its a bit pedantic, but I’m not really sure I support this kind of extremist view of copyright and the scale of whats being interpreted as ‘possessed’ under the idea of copyright. Once an idea is communicated, it becomes a part of the collective consciousness. Different people interpret and build upon that idea in various ways, making it a dynamic entity that evolves beyond the original creator’s intention. Its like issues with sampling beats or records in the early days of hiphop. Its like the very principal of an idea goes against this vision, more that, once you put something out into the commons, its irretrievable. Its not really yours any more once its been communicated. I think if you want to keep an idea truly yours, then you should keep it to yourself. Otherwise you are participating in a shared vision of the idea. You don’t control how the idea is interpreted so its not really yours any more.

If thats ChatGPT or Public Enemy is neither here nor there to me. The idea that a work like Peter Pan is still possessed is such a very real but very silly obvious malady of this weirdly accepted but very extreme view of the ability to possess an idea.

Bogasse@lemmy.world · 1 year ago

Well, I’d consider agreeing if the LLMs were considered as a generic knowledge database. However I had the impression that the whole response from OpenAI & cie. to this copyright issue is “they build original content”, both for LLMs and stable diffusion models. Now that they started this line of defence I think that they are stuck with proving that their “original content” is not derivated from copyrighted content 🤷

TropicalDingdong@lemmy.world · 1 year ago

Well, I’d consider agreeing if the LLMs were considered as a generic knowledge database. However I had the impression that the whole response from OpenAI & cie. to this copyright issue is “they build original content”, both for LLMs and stable diffusion models. Now that they started this line of defence I think that they are stuck with proving that their “original content” is not derivated from copyrighted content 🤷

Yeah I suppose that’s on them.

Toasteh@lemmy.world · 1 year ago

Copyright definitely needs to be stripped back severely. Artists need time to use their own work, but after a certain time everything needs to enter the public space for the sake of creativity.

fubo@lemmy.world · edit-2 1 year ago

If I memorize the text of Harry Potter, my brain does not thereby become a copyright infringement.

A copyright infringement only occurs if I then reproduce that text, e.g. by writing it down or reciting it in a public performance.

Training an LLM from a corpus that includes a piece of copyrighted material does not necessarily produce a work that is legally a derivative work of that copyrighted material. The copyright status of that LLM’s “brain” has not yet been adjudicated by any court anywhere.

If the developers have taken steps to ensure that the LLM cannot recite copyrighted material, that should count in their favor, not against them. Calling it “hiding” is backwards.

UnculturedSwine@lemmy.world · 1 year ago

Another sensationalist title. The article makes it clear that the problem is users reconstructing large portions of a copyrighted work word for word. OpenAI is trying to implement a solution that prevents ChatGPT from regurgitating entire copyrighted works using “maliciously designed” prompts. OpenAI doesn’t hide the fact that these tools were trained using copyrighted works and legally it probably isn’t an issue.

StrongFox@lemmy.world · 1 year ago

you bought the book to memorize from, anyway.

Agent641@lemmy.world · 1 year ago

No, I shoplifted it from an Aldi

khalic@lemmy.world · 1 year ago

An LLM is not a brain, stop anthropomorphising a fkn vector solver… it’s math, there’s nothing alive about it

Jilanico@lemmy.world · 1 year ago

What if you are just a vector solver but don’t realize it? We wouldn’t know we have neurons in our heads if scientists didn’t tell us. What even is consciousness?

khalic@lemmy.world · 1 year ago

All excellent questions, we need the answer to that. Until then, we don’t know, and can’t make up stuff just because we don’t.

Blapoo@lemmy.ml · 1 year ago

We have to distinguish between LLMs

Trained on copyrighted material and
Outputting copyrighted material

They are not one and the same

Even_Adder@lemmy.dbzer0.com · 1 year ago

Yeah, this headline is trying to make it seem like training on copyrighted material is or should be wrong.

TropicalDingdong@lemmy.world · 1 year ago

I think this brings up broader questions about the currently quite extreme interpretation of copyright. Personally I don’t think its wrong to sample from or create derivative works from something that is accessible. If its not behind lock and key, its free to use. If you have a problem with that, then put it behind lock and key. No one is forcing you to share your art with the world.

Bogasse@lemmy.world · 1 year ago

Most books are actually locked behind paywalls and not free to use? Or maybe I don’t understand what you meant?

Jumper775@lemmy.world · 1 year ago

Legally they will decide it is wrong, so it doesn’t matter. Power is in money and those with the copyrights have the money.

Skanky@lemmy.world · 1 year ago

Vanilla Ice had it right all along. Nobody gives a shit about copyright until big money is involved.

uis@lemmy.world · 1 year ago

Yep. Legally every word is copyrighted. Yes, law is THAT stupid.

UsernameIsTooLon@lemmy.world · 1 year ago

People think it’s a broken system, but it actually works exactly how the rich want it to work.

rosenjcb@lemmy.world · edit-2 1 year ago

The powers that be have done a great job convincing the layperson that copyright is about protecting artists and not publishers. It’s historically inaccurate and you can discover that copyright law was pushed by publishers who did not want authors keeping second hand manuscripts of works they sold to publishing companies.

Additional reading: https://en.m.wikipedia.org/wiki/Statute_of_Anne

paraphrand@lemmy.world · 1 year ago

Why are people defending a massive corporation that admits it is attempting to create something that will give them unparalleled power if they are successful?

Cosmic Cleric@lemmy.world · 1 year ago

Because ultimately, it’s about the truth of things, and not what team is winning or losing.

Whimsical@lemmy.world · 1 year ago

The dream would be that they manage to make their own glorious free & open source version, so that after a brief spike in corporate profit as they fire all their writers and artists, suddenly nobody needs those corps anymore because EVERYONE gets access to the same tools - if everyone has the ability to churn out massive content without hiring anyone, that theoretically favors those who never had the capital to hire people to begin with, far more than those who did the hiring.

Of course, this stance doesn’t really have an answer for any of the other problems involved in the tech, not the least of which is that there’s bigger issues at play than just “content”.

Stinkywinks@lemmy.world · 1 year ago

Because everyone learns from books, it’s stupid.

SCB@lemmy.world · 1 year ago

Leftists hating on AI while dreaming of post-scarcity will never not be funny

Technoguyfication@lemmy.ml · 1 year ago

People are acting like ChatGPT is storing the entire Harry Potter series in its neural net somewhere. It’s not storing or reproducing text in a 1:1 manner from the original material. Certain material, like very popular books, has likely been interpreted tens of thousands of times due to how many times it was reposted online (and therefore how many times it appeared in the training data).

Just because it can recite certain passages almost perfectly doesn’t mean it’s redistributing copyrighted books. How many quotes do you know perfectly from books you’ve read before? I would guess quite a few. LLMs are doing the same thing, but on mega steroids with a nearly limitless capacity for information retention.

Hup!@lemmy.world · edit-2 1 year ago

Nope people are just acting like ChatGPT is making commercial use of the content. Knowing a quote from a book isn’t copyright infringement. Selling that quote is. Also it doesn’t need to be content stored 1:1 somewhere to be infringement. That misses the point. If you’re making money of a synopsis you wrote based on imperfect memory and in your own words it’s still copyright infringment until you sign a licensing agreement with JK. Even transforming what you read into a different medium like a painting or poetry cam infinge the original authors copyrights.

Now mull that over and tell us what you think about modern copyright laws.

Ronath@lemmy.world · 1 year ago

Just adding, that, outside of Rowling, who I believe has a different contract than most authors due to the expanded Wizarding World and Pottermore, most authors themselves cannot quote their own novels online because that would be publishing part of the novel digitally and that’s a right they’ve sold to their publisher. The publisher usually ignores this as it creates hype for the work, but authors are careful not to abuse it.

abbotsbury@lemmy.world · 1 year ago

but on mega steroids with a nearly limitless capacity for information retention.

That sounds like redistributing copyrighted books

GeneralEmergency@lemmy.world · 1 year ago

So that explains the “problematic” responses.

ClamDrinker@lemmy.world · edit-2 1 year ago

This is just OpenAI covering their ass by attempting to block the most egregious and obvious outputs in legal gray areas, something they’ve been doing for a while, hence why their AI models are known to be massively censored. I wouldn’t call that ‘hiding’. It’s kind of hard to hide it was trained on copyrighted material, since that’s common knowledge, really.

afraid_of_zombies@lemmy.world · 1 year ago

I am sure they have patched it by now but at one point I was able to get chatgpt to give me copyright text from books by asking for ever large quotations. It seemed more willing to do this with books out of print.

stewsters@lemmy.world · 1 year ago

Yeah, it refuses to give you the first sentence from Harry Potter now.

Which is kinda lame, you can find that on thousands of webpages. Many of which the system indexed.

If someone was looking to pirate the book there are way easier ways than issuing thousands of queries to ChatGPT. Type “Harry Potter torrent” into Google and you will have them all in 30 seconds.

BURN@lemmy.world · 1 year ago

ChatGPT has a ton of extra query qualifiers added behind the scenes to ensure that specific outputs can’t happen

RadialMonster@lemmy.world · 1 year ago

what if they scraped a whole lot of the internet, and those excerpts were in random blogs and posts and quotes and memes etc etc all over the place? They didnt injest the material directly, or knowingly.

chemical_cutthroat@lemmy.world · 1 year ago

That’s why this whole argument is worthless, and why I think that, at its core, it is disingenuous. I would be willing to be a steak dinner that a lot of these lawsuits are just fishing for money, and the rest are set up by competition trying to slow the market down because they are lagging behind. AI is an arms race, and it’s growing so fast that if you got in too late, you are just out of luck. So, companies that want in are trying to slow down the leaders, at best, and at worst they are trying to make them publish their training material so they can just copy it. AI training models should be considered IP, and should be protected as such. It’s like trying to get the Colonel’s secret recipe by saying that all the spices that were used have been used in other recipes before, so it should be fair game.

Jat620DH27@lemmy.world · 1 year ago

I thought everyone knows that OpenAI has the same access to any books, knowledge that human beings have.

Redditiscancer789@lemmy.world · 1 year ago

Yes, but it’s what it is doing with it that is the murky grey area. Anyone can read a book, but you can’t use those books for your own commercial stuff. Rowling and other writers are making the case their works are being used in an inappropriate way commercially. Whether they have a case iunno ianal but I could see the argument at least.

Touching_Grass@lemmy.world · 1 year ago

Harry potter uses so many tropes and inspiration from other works that came before. How is that different? wizards of the coast should sue her into the ground.

Redditiscancer789@lemmy.world · edit-2 1 year ago

Because its not literally using the same stuff, you can be inspired by something ala Starcraft from Warhammer 40k, but you can’t use literally the same things. Also you can’t copyright as far as I understand it, broad subject matter. So no one can just copyright “wizard” but can copyright “Harry Potter the Wizard”. You also can tell the OpenAI company knows it may be doing something wrong because their latest leak includes passages on how to hide the fact the LLMs trained on copyrighted materials.

Touching_Grass@lemmy.world · 1 year ago

I would hide stuff too. Copyright laws are out of control. That doesn’t mean they did something wrong. Its CYA.

copyrights are for reproducing and selling others work not ingesting them. If they found it online it should be legal to ingest it. If they bought the works they should also be legally able to train off it

Redditiscancer789@lemmy.world · 1 year ago

No it does matter where they got the materials. If they illegally downloaded a copy off a website “just cause its on the internet” its still against the law.

Touching_Grass@lemmy.world · 1 year ago

Shouldn’t be illegal. Give them a letter how angry they are and call it a day

Redditiscancer789@lemmy.world · 1 year ago

Yeah…about that…

https://www.google.com/search?q=woman+sued+for+13+songs+on+napster&sca_esv=559711199&source=hp&ei=hFXnZPutG-Hg0PEPndKmwAI&oq=woman+sued+for+13+songs+on+napster&gs_lp=EhFtb2JpbGUtZ3dzLXdpei1ocCIid29tYW4gc3VlZCBmb3IgMTMgc29uZ3Mgb24gbmFwc3RlcjIFECEYoAEyBRAhGKABMgUQIRigATIFECEYoAFIpUxQnghY3EtwA3gAkAEAmAG0AaABhBqqAQUyNC4xMbgBA8gBAPgBAagCD8ICEBAAGAMYjwEY6gIYjAMY5QLCAgsQABiABBixAxiDAcICCxAuGIAEGLEDGIMBwgIREC4YgAQYsQMYgwEYxwEY0QPCAggQABiABBixA8ICCxAuGIAEGMcBGK8BwgILEAAYigUYsQMYgwHCAgUQABiABMICBRAuGIAEwgIIEC4YgAQYsQPCAggQLhixAxiABMICBBAAGAPCAgcQABiABBgKwgIIEAAYgAQYyQPCAgYQABgWGB7CAgUQIRirAsICCBAhGBYYHhgd&sclient=mobile-gws-wiz-hp

There’s clearly legal precedence.

dantheclamman@lemmy.world · 1 year ago

Google AI search preview seems to brazenly steal text from search results. Frequently its answers are the same word for word as a one of the snippets lower on the page

SMITHandWESSON@lemmy.world · edit-2 1 year ago

What the article is explaining is cliff notes or snippets of a story. Isn’t that allowed in some respect? People post notes from school books all the time, and those notes show up in Google searches as well.

I totally don’t know if I’m right, but doesn’t copyright infringement involve plagiarism like copying the whole book or writing a similar story that has elements of someone else’s work?

dantheclamman@lemmy.world · 1 year ago

I don’t know what’s considered fair use here. But the point is it’s taking words that aren’t theirs, which will deprive websites of traffic because then people won’t click through to the source article.

SMITHandWESSON@lemmy.world · edit-2 1 year ago

Ok I get now. I can definitely see both sides of the argument, and it’s not going to be easy to solve.

Copyright law needs to be updated to deal with all the new ways people and companies are using tech to access copyrighted material.

LordShrek@lemmy.world · 1 year ago

are we no longer allowed to borrow books from friends?

benni@lemmy.world · 1 year ago

Yeah, but if you wanna act out the contents of the book and sell it as a movie, you need to buy the rights.

nednobbins@lemmy.world · 1 year ago

Yes but there’s a threshold of how much you need to copy before it’s an IP violation.

Copying a single word is usually only enough if it’s a neologism.
Two matching words in a row usually isn’t enough either.
At some point it is enough though and it’s not clear what that point is.

On the other hand it can still be considered an IP violation if there are no exact word matches but it seems sufficiently similar.

Until now we’ve basically asked courts to step in and decide where the line should be on a case by case basis.

We never set the level of allowable copying to 0, we set it to “reasonable”. In theory it’s supposed to be at a level that’s sufficient to, “promote the Progress of Science and useful Arts, by securing for limited Times to Authors and Inventors the exclusive Right to their respective Writings and Discoveries.” (US Constitution, Article I, Section 8, Clause 8).

Why is it that with AI we take the extreme position of thinking that an AI that makes use of any information from humans should automatically be considered to be in violation of IP law?

assassin_aragorn@lemmy.world · 1 year ago

Making use of the information is not a violation – making use of that violation to turn a profit is a violation. AI software that is completely free for the masses without any paid upgrades can look at whatever it wants. As soon as a corporation is making money on it though, it’s in violation and needs to pay up.

nednobbins@lemmy.world · 1 year ago

Is that intended as a legal or moral position?

As far as I know, the law doesn’t care much if you make money off of IP violations. There are many cases of individuals getting hefty fines for both the personal use and free distribution of IP. I think if there is commercial use of IP the profits are forfeit to the IP holder. I’m not a lawyer though, so don’t bank on that.

There’s still the initial question too. At present, we let the courts decide if the usage, whether profitable or not, meets the standard of IP violation. Artists routinely take inspiration from one another and sometimes they take it too far. Why should we assume that AI automatically takes it too far and always meets the standard of IP violation?

GroggyGuava@lemmy.world · 1 year ago

Idk that feels like saying that as soon as you sell the skills you learned on YouTube, you should have to start paying the people you learned from, since you’re “using” their copyrighted material to turn profit.

I don’t agree whatsoever that copyright extends to inspiration of other artists/data models. Unless they recreate what you’ve made in a sufficiently similar manor, they haven’t copied you.

Cosmic Cleric@lemmy.world · 1 year ago

Why is it that with AI we take the extreme position of thinking that an AI that makes use of any information from humans should automatically be considered to be in violation of IP law?

Luddites throwing their sabots into the machinery.

Cosmic Cleric@lemmy.world · edit-2 1 year ago

Not if your stories are transformative of the original work.

BURN@lemmy.world · 1 year ago

AI works are not transformative. No new content is added.

Touching_Grass@lemmy.world · edit-2 1 year ago

The work generated is entirely new

Cosmic Cleric@lemmy.world · 1 year ago

I would disagree, they are transformative.

BURN@lemmy.world · 1 year ago

I don’t see how it can be transformative if no new context can be added.

Also transformative is only 1 of the 4 pillars of free use and isn’t enough to justify it alone.

Cosmic Cleric@lemmy.world · 1 year ago

You’re just repeating yourself.

New content is created, that’s the whole point. The AI takes the original content and transforms it and creates new content.

BURN@lemmy.world · 1 year ago

I don’t believe that AI creates new content. Fundamentally it just statistically repeats what it’s seen before. That’s not new content, it’s a mad lib.

Current AI is great at pattern recognition, that’s it. It recognizes patterns in words and outputs the next one it guesses is right.

LordShrek@lemmy.world · 1 year ago

yes, but that’s a different situation. with the LLM, the issue is that the text from copyrighted books are influencing the way it speaks. this is the same with humans.

Touching_Grass@lemmy.world · edit-2 1 year ago

Mods remove this comment as this instance no longer tolerates discussions of piracy. We went through this last week

scarabic@lemmy.world · 1 year ago

One of the first things I ever did with ChatGPT was ask it to write some Harry Potter fan fiction. It wrote a short story about Ron and Harry getting into trouble. I never said the word McGonagal and yet she appeared in the story.

So yeah, case closed. They are full of shit.