I’d like to play devil’s advocate for a sec and ask this question, how is a company scraping information from publicly available sources to train AI models any different than companies scraping that same publicly available data and indexing it for search?
While the search model is helpful to is all, Google isn’t doing it out of the kindness of their hearts, they have a whole business model based on selling advertising utilizing the information they have freely indexed. Yet very few complain about search indexers crawling their data like they do AI bots.
Again, just playing devil’s advocate for the sake of curiosity.
This is all true, with one key difference: search results (used to) point you to the actual source. LLMs answer you with that information as if they thought of it, with no attribution. So at least search results have a benefit for the source of indexed content.
I don’t know about all AI products, but I know that I use the Copilot sidebar built into edge for work and school questions and it always provides citations to the source information. In fact if I ask a question for school and add in the prompt to cite all sources with a reference in APA format, it gives me everything I need in proper format.
Yeah, it’s useful but double check your sources and never hand in anything, even the citations by just copy and pasting it without scrutiny. It can make up all kinds of bullshit, pretend cited works say something when they don’t, etc.
You don’t want to it to hallucinate you in front of an academic ethics committee. Again, not against using it, but never base anything on stuff it says, only base stuff on primary sources it helped you find.
Fully agree. Honestly, it’s why I like the Copilot branding Microsoft used. It is a Copilot, not the Captain. You still need to be in control and verify and scrutinize.
You answered your own question. The search engine indexes your page to send traffic to you. The AI bot indexes your page to plagiarize your content.
Anecdotally, AI also routinely ignores sites’ robots.txt and spoof their agents to try to hide what they’re doing. A lot of site owners are complaining about the costs of delivering content to web scrapers. Where search indexes might hit a site every day, some AI bots are running every hour and just wasting their bandwidth.
There are credible allegations that the AI companies are not merely scraping publicly available resources, but are also consuming content in violation of the terms of use / copyright law. Like, a site has a robots.txt file that says “no scrapers” and they scrape it anyway. People would be mad about traditional search doing that as well.
Secondly, if a search service scrapes your site and then directs relevant users to it, that’s probably fine. Most websites want users to visit. A lot of AI stuff sucks up the content, and then the creators of that content get nothing. No users are sent there. The scraper hitting the site takes resources, and gives nothing back.
Google has also gotten some flak for putting stuff on their own site instead of sending users to the source. Like you do a search and get a snippet on the google page, and you never click through to example.com/cool-stuff. Well, now the owner of example.com/cool-stuff doesn’t get the click. If they run ads, they get no credit. If they have metrics, they probably don’t see any visitors. If they have like forums, people are less likely to engage.
If the “AI Search” includes links back to the source, that’s not perfect either. One, it’s kind of excessive to use an LLM to parse text when the origin site is already there and readable. If I search for “population of london”, you can just send me to a census website or even wikipedia. You don’t need to use a whole ass LLM. Two, as I touched on in the previous paragraph, users are less likely to click through if google is putting the core of the information right there (even if it’s not always accurate). It’s still lessening traffic to the origin site, and traffic is often the lifeblood of websites.
Lastly, a lot of AI stuff is simply inaccurate or misleading. We’ve all laughed at the “use glue on your pizza” stuff or the “there are two Rs in ‘strawberry’” fuckups. If traditional search was really bad, like you type in “cat food” and you got a webpage that was all jewelry and “buy gold” scams, you’d be annoyed, too. That’s more like how search was before old google came about. There were a lot more low effort “SEO” hacks like putting a bunch of keywords in tiny print to fool the search indexer. Now google is the shitty old guard, but they have too much money and power to be easily replaced.
That’s just off the top of my head. Scraping for AI isn’t the same as scraping to make a searchable index.
You likely consented to search crawlers. You didn’t consent to having your site slammed by AI bots to regurgitate your site either privately or publicly.
If memory serves me correctly, nobody concented to the search indexes originally either, it took time for those guard rails to be put in place and respected. I would imagine that this new tech will undergo the same growing pains as guard rails get implemented.
Yeah but the difference is that search engines act in synergy while AI models usually extract value from the site. One is getting your woodworking shop in the phonebook without consent, the other is taking your lathe out the door.
i find people who “play devils advocate” just unnecessarily exhausting to the cause.
If you have a opposing opinion just say it, if not then don’t. This is real life and not debate club
Well, I was in a debate club so I suppose that is where it comes from.
Also, saying I have an opposing opinion is fine, if that is my actual stance. I this case it’s more of an I can see both sides of the argument and would like to have a rounded discussion rather than a reddit echo chamber.
I have to disagree. I often form opinions gradually over time as i learn about the issue and playing devils advocate can help that process. If less people planted themselves in certain yay or nay camps our conversations would be far more honest and productive. Devils advocate arguments can sometime be like thought experiments to help us learn about and understand an issue.
I’d like to play devil’s advocate for a sec and ask this question, how is a company scraping information from publicly available sources to train AI models any different than companies scraping that same publicly available data and indexing it for search?
While the search model is helpful to is all, Google isn’t doing it out of the kindness of their hearts, they have a whole business model based on selling advertising utilizing the information they have freely indexed. Yet very few complain about search indexers crawling their data like they do AI bots.
Again, just playing devil’s advocate for the sake of curiosity.
This is all true, with one key difference: search results (used to) point you to the actual source. LLMs answer you with that information as if they thought of it, with no attribution. So at least search results have a benefit for the source of indexed content.
I don’t know about all AI products, but I know that I use the Copilot sidebar built into edge for work and school questions and it always provides citations to the source information. In fact if I ask a question for school and add in the prompt to cite all sources with a reference in APA format, it gives me everything I need in proper format.
Yeah, it’s useful but double check your sources and never hand in anything, even the citations by just copy and pasting it without scrutiny. It can make up all kinds of bullshit, pretend cited works say something when they don’t, etc.
You don’t want to it to hallucinate you in front of an academic ethics committee. Again, not against using it, but never base anything on stuff it says, only base stuff on primary sources it helped you find.
Fully agree. Honestly, it’s why I like the Copilot branding Microsoft used. It is a Copilot, not the Captain. You still need to be in control and verify and scrutinize.
That’s not the same. In that case copilot is also doing a search. They’re talking about the model itself
You answered your own question. The search engine indexes your page to send traffic to you. The AI bot indexes your page to plagiarize your content.
Anecdotally, AI also routinely ignores sites’ robots.txt and spoof their agents to try to hide what they’re doing. A lot of site owners are complaining about the costs of delivering content to web scrapers. Where search indexes might hit a site every day, some AI bots are running every hour and just wasting their bandwidth.
There are credible allegations that the AI companies are not merely scraping publicly available resources, but are also consuming content in violation of the terms of use / copyright law. Like, a site has a robots.txt file that says “no scrapers” and they scrape it anyway. People would be mad about traditional search doing that as well.
Secondly, if a search service scrapes your site and then directs relevant users to it, that’s probably fine. Most websites want users to visit. A lot of AI stuff sucks up the content, and then the creators of that content get nothing. No users are sent there. The scraper hitting the site takes resources, and gives nothing back.
Google has also gotten some flak for putting stuff on their own site instead of sending users to the source. Like you do a search and get a snippet on the google page, and you never click through to example.com/cool-stuff. Well, now the owner of example.com/cool-stuff doesn’t get the click. If they run ads, they get no credit. If they have metrics, they probably don’t see any visitors. If they have like forums, people are less likely to engage.
If the “AI Search” includes links back to the source, that’s not perfect either. One, it’s kind of excessive to use an LLM to parse text when the origin site is already there and readable. If I search for “population of london”, you can just send me to a census website or even wikipedia. You don’t need to use a whole ass LLM. Two, as I touched on in the previous paragraph, users are less likely to click through if google is putting the core of the information right there (even if it’s not always accurate). It’s still lessening traffic to the origin site, and traffic is often the lifeblood of websites.
Lastly, a lot of AI stuff is simply inaccurate or misleading. We’ve all laughed at the “use glue on your pizza” stuff or the “there are two Rs in ‘strawberry’” fuckups. If traditional search was really bad, like you type in “cat food” and you got a webpage that was all jewelry and “buy gold” scams, you’d be annoyed, too. That’s more like how search was before old google came about. There were a lot more low effort “SEO” hacks like putting a bunch of keywords in tiny print to fool the search indexer. Now google is the shitty old guard, but they have too much money and power to be easily replaced.
That’s just off the top of my head. Scraping for AI isn’t the same as scraping to make a searchable index.
You likely consented to search crawlers. You didn’t consent to having your site slammed by AI bots to regurgitate your site either privately or publicly.
If memory serves me correctly, nobody concented to the search indexes originally either, it took time for those guard rails to be put in place and respected. I would imagine that this new tech will undergo the same growing pains as guard rails get implemented.
Yeah but the difference is that search engines act in synergy while AI models usually extract value from the site. One is getting your woodworking shop in the phonebook without consent, the other is taking your lathe out the door.
+1 for a woodworking analogy. :)
Originally, you had to submit your website to search engines.
The plagiarism machine indexes for plagiarism.
i find people who “play devils advocate” just unnecessarily exhausting to the cause. If you have a opposing opinion just say it, if not then don’t. This is real life and not debate club
Well, I was in a debate club so I suppose that is where it comes from.
Also, saying I have an opposing opinion is fine, if that is my actual stance. I this case it’s more of an I can see both sides of the argument and would like to have a rounded discussion rather than a reddit echo chamber.
I have to disagree. I often form opinions gradually over time as i learn about the issue and playing devils advocate can help that process. If less people planted themselves in certain yay or nay camps our conversations would be far more honest and productive. Devils advocate arguments can sometime be like thought experiments to help us learn about and understand an issue.