Your Content and AI Training

Illustration of robot reading text on large page within a library

It’s very late coming but we’re slowly getting the means to indicate that our site content should not be used in training-sets for LLMs in generative AI.

It is still unclear if these web crawlers were used to train current LLMs or whether this will apply only to future LLMs. It should be noted that there are a large number of AI companies but only one is providing an opt-out so far.

In most cases, preventing an AI company from pulling down your sites content means having the knowledge and ability to update the robots.txt file.

You can learn more about that here:

I’ve listed all the known methods I’ve come across and will try to update as information changes or becomes available.


GPTBot

GPTBot is OpenAI‚Äôs web crawler. You can update your web site’s robots.txt file to exclude your site from being crawled.

User-agent: GPTBot
Disallow: /

Alternatively, you can also block the IP ranges used by the web crawler:

// Current as of August 2023
20.15.240.64/28
20.15.240.80/28
20.15.240.96/28
20.15.240.176/28
20.15.241.0/28
20.15.242.128/28
20.15.242.144/28
20.15.242.192/28
40.83.2.64/28

Sources:

CCBot

The CCBot is operated by a non-profit organization called Common Crawl and you might not want to block this right away. Their goal is to provide an open repository of web crawl data that is universally accessible and analyzable.

This is a very noble effort, but this data is also available to AI companies to use too.

User-agent: CCBot
Disallow: /

Source:

  • FAQ (Common Crawl)

GoogleBot

You cannot block Google from using your content for its AI efforts, such as Bard, without blocking your site from Google’s search – so you will likely not want to do this.

However, it can still be done:

User-agent: Googlebot
Disallow: /