Your Content and AI Training

Illustration of robot reading text on large page within a library

It’s very late in coming, but we’re slowly getting the means to indicate that our site content should not be used in training sets for generative AI models.

It’s still unclear whether existing web crawlers were used to train current LLMs or if these restrictions will only apply to future models. It’s also worth noting that while there are many AI companies, only one has provided a clear opt-out so far.

In most cases, stopping an AI company from crawling your site requires the knowledge and ability to update your robots.txt file.

You can learn more about that here:

I’ve listed all the known methods I’ve come across and will try to update as information changes or becomes available.


GPTBot

GPTBot is OpenAI’s web crawler. You can update your web site’s robots.txt file to exclude your site from being crawled.

User-agent: GPTBot
Disallow: /

Alternatively, you can also block the IP ranges used by the web crawler:

// Current as of August 2023
20.15.240.64/28
20.15.240.80/28
20.15.240.96/28
20.15.240.176/28
20.15.241.0/28
20.15.242.128/28
20.15.242.144/28
20.15.242.192/28
40.83.2.64/28

Sources:

CCBot

The CCBot is operated by a non-profit organization called Common Crawl and you might not want to block this right away. Their goal is to provide an open repository of web crawl data that is universally accessible and analyzable.

This is a very noble effort, but this data is also available to AI companies to use too.

User-agent: CCBot
Disallow: /

Source:

  • FAQ (Common Crawl)

GoogleBot

You cannot block Google from using your content for its AI efforts, such as Bard, without blocking your site from Google’s search – so you will likely not want to do this.

However, it can still be done:

User-agent: Googlebot
Disallow: /