Your Content and AI Training

It’s very late in coming, but we’re slowly getting the means to indicate that our site content should not be used in training sets for generative AI models.
It’s still unclear whether existing web crawlers were used to train current LLMs or if these restrictions will only apply to future models. It’s also worth noting that while there are many AI companies, only one has provided a clear opt-out so far.
In most cases, stopping an AI company from crawling your site requires the knowledge and ability to update your robots.txt file.
You can learn more about that here:
- What is robots.txt? (Cloudflare)
- Robots.txt (Wikipedia)
- Robots.txt (Mozilla)
I’ve listed all the known methods I’ve come across and will try to update as information changes or becomes available.
GPTBot
GPTBot is OpenAI’s web crawler. You can update your web site’s robots.txt file to exclude your site from being crawled.
User-agent: GPTBot
Disallow: /
Alternatively, you can also block the IP ranges used by the web crawler:
// Current as of August 2023
20.15.240.64/28
20.15.240.80/28
20.15.240.96/28
20.15.240.176/28
20.15.241.0/28
20.15.242.128/28
20.15.242.144/28
20.15.242.192/28
40.83.2.64/28
Sources:
- GPTBot (OpenAI)
- GPTBot IP Ranges (OpenAI)
CCBot
The CCBot is operated by a non-profit organization called Common Crawl and you might not want to block this right away. Their goal is to provide an open repository of web crawl data that is universally accessible and analyzable.
This is a very noble effort, but this data is also available to AI companies to use too.
User-agent: CCBot
Disallow: /
Source:
- FAQ (Common Crawl)
GoogleBot
You cannot block Google from using your content for its AI efforts, such as Bard, without blocking your site from Google’s search – so you will likely not want to do this.
However, it can still be done:
User-agent: Googlebot
Disallow: /