© Copyright Acquisition International 2024 - All Rights Reserved.

Article Image - Should You Block AI Bots from Crawling Your Website?
Posted 15th April 2024

Should You Block AI Bots from Crawling Your Website?

Did you know AI’s like ChatGPT could be crawling your site for data? AI large language models (LLMs) like ChatGPT and Bard (now called Gemini) has raised a question for businesses: block or allow AI bots like ChatGPT’s GPTBot to crawl your site? As AI is a rapidly developing technology, it’s a question business might not have thought they would need to consider in 2024, but one that should be high on the agenda. With sites like Amazon choosing to block ChatGPT, it’s clear that we should all be considering whether this is the right move.

Mouse Scroll AnimationScroll to keep reading

Let us help promote your business to a wider following.

Should You Block AI Bots from Crawling Your Website?
Web Crawler

Did you know AI’s like ChatGPT could be crawling your site for data? AI large language models (LLMs) like ChatGPT and Bard (now called Gemini) has raised a question for businesses: block or allow AI bots like ChatGPT’s GPTBot to crawl your site? As AI is a rapidly developing technology, it’s a question business might not have thought they would need to consider in 2024, but one that should be high on the agenda. With sites like Amazon choosing to block ChatGPT, it’s clear that we should all be considering whether this is the right move.  

According to SEO and PPC agency MRS Digital, businesses should make a careful decision on whether they want to block AI or not. On the one hand, blocking AI could help prevent risks, such as your content being unintentionally misrepresented. On the other, could you be missing a world of opportunity presented by this seemingly unstoppable technology shift? 

Quick recap: What are crawlers?  

A crawler is essentially a tool that is typically operated by search engines like Google or Bing to review your website and index data from it, like the content you’ve written and information about your company, ensuring your website appears in search results. It’s how search engines like Google discover and understand your site, so the concept of a crawler is nothing new. As a website owner you can decide which parts of your website you want crawlers to be able to view and index in search results by making use of robots.txt files.  
 
AI crawlers use the same technology but instead of simply indexing your website data, AI crawlers review the information on your site and can utilise it to train their own technology (Large Langue Models). 

What AI chatbots could crawl your site?  

While most people will have heard of ChatGPT and Bard (now called Gemini), there are other lesser-known AI crawlers out there.  

So, the other AI crawlers. There’s:  

  • ChatGPT-User. This is used by ChatGPT when a user on GPT-4 directs the bot to your site in a prompt like “tell me how many times [SITE URL] mentions AI”.  
  • GPTBot. This is the crawler that just gets the data from your site for training data for their AI knowledge base. 
  • Google Extended. This is how Google gets data for all their AI products, including Gemini (previously called Bard), their AI chatbot.  
  • Anthropic-AI. Anthropic has a range of AI tools, including Claude, their AI chatbot, and their crawler collects the data for this.  
  • CC-Bot. This is the Common Crawler bot and is what ChatGPT-3 was trained on. It’s designed to make access to data accessible for everyone, without any fees.  

Why would you block AI crawlers? 

Blocking AI crawlers might be the right decision for you, especially if you’re concerned about your content being misrepresented, or your site is in development.  

Misrepresented content 

When humans write content, we write with nuance and there may be cultural or business context that means what you write makes sense to a specific audience. When your content is taken out of that context and used to form part of an AI chatbot’s answer, it will most likely lose the nuance, and the point your content made may have been lost or misrepresented entirely. For some companies, that isn’t something they want to run the risk of, and so they block AI crawlers to prevent this. For example, if you were a medical company who had specific advice pertaining to one of your products, you wouldn’t want an AI to take that out of context to an unrelated product or medical query.  

Unwanted association 

As AI crawlers tend to take sections of information from varying websites without always understanding the context of that piece of information. There is a risk that your information may be presented next to additional sources that your business doesn’t want to be associated with. If this is the case, then you may want to choose to block any AI crawler. This will stop your company from being mixed in with competitors, or those in your industry who may not uphold best practice. For some companies where reputation management is an issue, or has been historically, this could be a very strong argument.  

Data Scraping 

It’s best practice to block any crawlers from viewing parts of your site you don’t want them to see. For example, you might have a staff wellbeing portal on your intranet or customer logins on your website. You don’t want these crawled as they contain personally identifiable information, something your customers or employees definitely don’t want an AI company to have! OpenAI says that the GPTBot is “filtered to remove sources that require paywall access, are known to primarily aggregate personally identifiable information (PII), or have text that violates our policies.” Most websites will already have these blocked, so it’s worth speaking to your hosting provider or SEO team to see if they can add any AI crawlers.  

Spam Generation 

As technology evolves, so do cybercriminals. We’re now seeing the most sophisticated phishing emails and malicious links being sent thanks to AI-generated content. Using a combination of AI-powered chatbots like ChatGPT and data harvested from your site means that spam emails are more realistic than ever. Malicious actors could then use AI to create even more realistic spam emails which could more closely imitate your employees or the company itself. This ultimately could lead to more successful phishing attempts which can cause financial and reputational loss for your business.  

How to block the ChatGPT crawler: 

1. robots.txt 

A robots.txt file will already be present on your site. It’s simply a matter of updating it to exclude the pages you want to block any AI crawler from viewing. Doing this will protect any sensitive data and content that should not be public knowledge. Robots.txt files done wrong can cause your site to no longer be seen by Google and other search engines so it’s best to proceed with caution here. If you have an SEO agency, check in with them before you do this as they may be able to help you.  

You can also allow them to crawl at a certain speed. If you want them to crawl some, but not all, of your site, such as admin areas, this is possible too. Different businesses will have varying reasons to block crawlers, or not block them at all. 

2. Web Application Firewall (WAF) 

You can also use a WAF to block the crawler(s) as well as any unwanted traffic to your site. You’ll be able to keep it up and running for your customers without hindering their experience on your website. 

So, is it really worth blocking AI crawlers?  

When considering whether to block ChatGPT and similar crawlers, there’s more to ponder over than the downsides alone. 

In November 2023, ChatGPT hit 100 million users per week. With this figure likely to grow, that’s a great deal of brand visibility you’re missing out on if you refuse to embrace this technology.  

LLMs are the future of search, did nobody tell you yet? Bing has already embraced AI in the form of Microsoft’s Copilot, and Google is hot on its heels, recently moving the testing of its own AI-powered search – Google Search Generative Experience (SGE) – into the main Google search results. This means that if you’ve ever relied on organic search or SEO for a portion of your business generation, blocking AI could seriously hamper your efforts, if not now, then in the near future. 

There’s even a branch of SEO forming known as Generative Engine Optimisation (GEO) that focuses on improving visibility on popular LLMs like ChatGPT. Again, this may be an emerging acquisition channel that you’re missing out on if you block AI crawlers. 

You should also consider really how effective it is trying to block LLMs from your site. First, you must look beyond the big- name AIs. Blocking ChatGPT alone won’t cut it. Large language models like this are trained on a range of different datasets like Wikipedia and Reddit.  

One of the datasets most commonly used by LLMs (including ChatGPT) is Common Crawl which has been created by a non-profit organisation and crawls the entire internet. So, if you’re genuinely determined to exclude your site from LLMs, then you need block bots like Common Crawl as well more popular crawlers. 

Granting access to your website content can assist in ensuring that your brand is accurately and favourably portrayed to ChatGPT users. Blocking it may actually have the opposite effect if you’re trying to avoid being misrepresented online. 

All said and done, let’s say you bend over backwards to block every known crawler belonging to and contributing to AI LLMs. You’re safe right? Wrong. Your website has almost certainly already been crawled and incorporated into existing datasets like Common Crawl’s. And, at present, there’s no way of removing your website content from these datasets. It all feels rather futile. 

A final word

The rapidly evolving world of AIs is intimidating, and whether you decide to attempt to block LLMs from your site or not, we’d recommend that it’s worth genning up on the subject. Whatever decision you make should be active and informed by up-to-date knowledge. 

A not insignificant 32.9% of the top 1,000 websites on the internet have elected to block the GPTBot. However, for many the growing opportunity presented by AI, combined with the futility of trying to resist the tide, has led to the decision that blocking AI is not the right move. At least for now. 

Categories: News, Strategy


You Might Also Like
Read Full PostRead - Eye Icon
Cross-Border Payments: Challenges and Solutions for AP Managers
News
20/05/2024Cross-Border Payments: Challenges and Solutions for AP Managers

In today’s global economy, handling cross-border payments effectively is essential for companies that do business internationally. In 2023, the total value of cross-border payments had soared to $190.1 trillion, highlighting the extensive and vital role

Read Full PostRead - Eye Icon
How Fleet Insurance Can Enhance Risk Equations For SMEs?
News
15/06/2022How Fleet Insurance Can Enhance Risk Equations For SMEs?

Fleet insurance is insurance coverage for a business’s vehicles. Instead of getting insurance for individual cars, you can get collective insurance for the fleet. It will save you the hassle of monitoring each vehicle’s separate car insurance policy and re

Read Full PostRead - Eye Icon
Grant Thornton Advise Four Communications investment from BGF
Finance
20/08/2015Grant Thornton Advise Four Communications investment from BGF

Grant Thornton Advise Four Communications investment from BGF

Read Full PostRead - Eye Icon
How To Streamline Your Sales Engagement Process
Strategy
11/01/2023How To Streamline Your Sales Engagement Process

Every business organisation aims to improve its efficiency. Business growth and success majorly depend on the efficiency of processes. And one of them is sales engagement.   Apart from increasing efficiency, a streamlined sales engagement process enhances pro

Read Full PostRead - Eye Icon
Festive Philanthropy: Five Tips to Gift Well
Corporate Social Responsibility
15/12/2022Festive Philanthropy: Five Tips to Gift Well

In the midst of the plethora of challenges facing us all, those who are fortunate to have something to spare may be seeking ways to support causes close to their hearts this Christmas, the peak time for charitable giving. But with so many worthy causes, how do

Read Full PostRead - Eye Icon
Tripwire Study: UK Executives Give Boards ‘A’ in Cyber Literacy
Innovation
17/06/2015Tripwire Study: UK Executives Give Boards ‘A’ in Cyber Literacy

Tripwire, Inc. announced the results of a study sponsored by Tripwire on cyber literacy challenges faced by organisations.

Read Full PostRead - Eye Icon
Antal Acquires Smart Moves
M&A
27/05/2015Antal Acquires Smart Moves

Antal Acquires Smart Moves

Read Full PostRead - Eye Icon
Scapa’s Acquisition of First Water
Finance
08/04/2015Scapa’s Acquisition of First Water

Scapa's Acquisition of First Water

Read Full PostRead - Eye Icon
Quantitative Easing and How it Affects The UK Economy
Strategy
02/07/2020Quantitative Easing and How it Affects The UK Economy

Quantitative easing is a monetary policy used by the governments of nations during difficult economic times to boost the economy. Quantitative easing comes into play when a nation is grappling with drastic economic slowdown or recession.



Our Trusted Brands

Acquisition International is a flagship brand of AI Global Media. AI Global Media is a B2B enterprise and are committed to creating engaging content allowing businesses to market their services to a larger global audience. We have 14 unique brands, each of which serves a specific industry or region. Each brand covers the latest news in its sector and publishes a digital magazine and newsletter which is read by a global audience.

Arrow