© Copyright Acquisition International 2024 - All Rights Reserved.

Article Image - Should You Block AI Bots from Crawling Your Website?
Posted 15th April 2024

Should You Block AI Bots from Crawling Your Website?

Did you know AI’s like ChatGPT could be crawling your site for data? AI large language models (LLMs) like ChatGPT and Bard (now called Gemini) has raised a question for businesses: block or allow AI bots like ChatGPT’s GPTBot to crawl your site? As AI is a rapidly developing technology, it’s a question business might not have thought they would need to consider in 2024, but one that should be high on the agenda. With sites like Amazon choosing to block ChatGPT, it’s clear that we should all be considering whether this is the right move.

Mouse Scroll AnimationScroll to keep reading

Let us help promote your business to a wider following.

Should You Block AI Bots from Crawling Your Website?
Web Crawler

Did you know AI’s like ChatGPT could be crawling your site for data? AI large language models (LLMs) like ChatGPT and Bard (now called Gemini) has raised a question for businesses: block or allow AI bots like ChatGPT’s GPTBot to crawl your site? As AI is a rapidly developing technology, it’s a question business might not have thought they would need to consider in 2024, but one that should be high on the agenda. With sites like Amazon choosing to block ChatGPT, it’s clear that we should all be considering whether this is the right move.  

According to SEO and PPC agency MRS Digital, businesses should make a careful decision on whether they want to block AI or not. On the one hand, blocking AI could help prevent risks, such as your content being unintentionally misrepresented. On the other, could you be missing a world of opportunity presented by this seemingly unstoppable technology shift? 

Quick recap: What are crawlers?  

A crawler is essentially a tool that is typically operated by search engines like Google or Bing to review your website and index data from it, like the content you’ve written and information about your company, ensuring your website appears in search results. It’s how search engines like Google discover and understand your site, so the concept of a crawler is nothing new. As a website owner you can decide which parts of your website you want crawlers to be able to view and index in search results by making use of robots.txt files.  
 
AI crawlers use the same technology but instead of simply indexing your website data, AI crawlers review the information on your site and can utilise it to train their own technology (Large Langue Models). 

What AI chatbots could crawl your site?  

While most people will have heard of ChatGPT and Bard (now called Gemini), there are other lesser-known AI crawlers out there.  

So, the other AI crawlers. There’s:  

  • ChatGPT-User. This is used by ChatGPT when a user on GPT-4 directs the bot to your site in a prompt like “tell me how many times [SITE URL] mentions AI”.  
  • GPTBot. This is the crawler that just gets the data from your site for training data for their AI knowledge base. 
  • Google Extended. This is how Google gets data for all their AI products, including Gemini (previously called Bard), their AI chatbot.  
  • Anthropic-AI. Anthropic has a range of AI tools, including Claude, their AI chatbot, and their crawler collects the data for this.  
  • CC-Bot. This is the Common Crawler bot and is what ChatGPT-3 was trained on. It’s designed to make access to data accessible for everyone, without any fees.  

Why would you block AI crawlers? 

Blocking AI crawlers might be the right decision for you, especially if you’re concerned about your content being misrepresented, or your site is in development.  

Misrepresented content 

When humans write content, we write with nuance and there may be cultural or business context that means what you write makes sense to a specific audience. When your content is taken out of that context and used to form part of an AI chatbot’s answer, it will most likely lose the nuance, and the point your content made may have been lost or misrepresented entirely. For some companies, that isn’t something they want to run the risk of, and so they block AI crawlers to prevent this. For example, if you were a medical company who had specific advice pertaining to one of your products, you wouldn’t want an AI to take that out of context to an unrelated product or medical query.  

Unwanted association 

As AI crawlers tend to take sections of information from varying websites without always understanding the context of that piece of information. There is a risk that your information may be presented next to additional sources that your business doesn’t want to be associated with. If this is the case, then you may want to choose to block any AI crawler. This will stop your company from being mixed in with competitors, or those in your industry who may not uphold best practice. For some companies where reputation management is an issue, or has been historically, this could be a very strong argument.  

Data Scraping 

It’s best practice to block any crawlers from viewing parts of your site you don’t want them to see. For example, you might have a staff wellbeing portal on your intranet or customer logins on your website. You don’t want these crawled as they contain personally identifiable information, something your customers or employees definitely don’t want an AI company to have! OpenAI says that the GPTBot is “filtered to remove sources that require paywall access, are known to primarily aggregate personally identifiable information (PII), or have text that violates our policies.” Most websites will already have these blocked, so it’s worth speaking to your hosting provider or SEO team to see if they can add any AI crawlers.  

Spam Generation 

As technology evolves, so do cybercriminals. We’re now seeing the most sophisticated phishing emails and malicious links being sent thanks to AI-generated content. Using a combination of AI-powered chatbots like ChatGPT and data harvested from your site means that spam emails are more realistic than ever. Malicious actors could then use AI to create even more realistic spam emails which could more closely imitate your employees or the company itself. This ultimately could lead to more successful phishing attempts which can cause financial and reputational loss for your business.  

How to block the ChatGPT crawler: 

1. robots.txt 

A robots.txt file will already be present on your site. It’s simply a matter of updating it to exclude the pages you want to block any AI crawler from viewing. Doing this will protect any sensitive data and content that should not be public knowledge. Robots.txt files done wrong can cause your site to no longer be seen by Google and other search engines so it’s best to proceed with caution here. If you have an SEO agency, check in with them before you do this as they may be able to help you.  

You can also allow them to crawl at a certain speed. If you want them to crawl some, but not all, of your site, such as admin areas, this is possible too. Different businesses will have varying reasons to block crawlers, or not block them at all. 

2. Web Application Firewall (WAF) 

You can also use a WAF to block the crawler(s) as well as any unwanted traffic to your site. You’ll be able to keep it up and running for your customers without hindering their experience on your website. 

So, is it really worth blocking AI crawlers?  

When considering whether to block ChatGPT and similar crawlers, there’s more to ponder over than the downsides alone. 

In November 2023, ChatGPT hit 100 million users per week. With this figure likely to grow, that’s a great deal of brand visibility you’re missing out on if you refuse to embrace this technology.  

LLMs are the future of search, did nobody tell you yet? Bing has already embraced AI in the form of Microsoft’s Copilot, and Google is hot on its heels, recently moving the testing of its own AI-powered search – Google Search Generative Experience (SGE) – into the main Google search results. This means that if you’ve ever relied on organic search or SEO for a portion of your business generation, blocking AI could seriously hamper your efforts, if not now, then in the near future. 

There’s even a branch of SEO forming known as Generative Engine Optimisation (GEO) that focuses on improving visibility on popular LLMs like ChatGPT. Again, this may be an emerging acquisition channel that you’re missing out on if you block AI crawlers. 

You should also consider really how effective it is trying to block LLMs from your site. First, you must look beyond the big- name AIs. Blocking ChatGPT alone won’t cut it. Large language models like this are trained on a range of different datasets like Wikipedia and Reddit.  

One of the datasets most commonly used by LLMs (including ChatGPT) is Common Crawl which has been created by a non-profit organisation and crawls the entire internet. So, if you’re genuinely determined to exclude your site from LLMs, then you need block bots like Common Crawl as well more popular crawlers. 

Granting access to your website content can assist in ensuring that your brand is accurately and favourably portrayed to ChatGPT users. Blocking it may actually have the opposite effect if you’re trying to avoid being misrepresented online. 

All said and done, let’s say you bend over backwards to block every known crawler belonging to and contributing to AI LLMs. You’re safe right? Wrong. Your website has almost certainly already been crawled and incorporated into existing datasets like Common Crawl’s. And, at present, there’s no way of removing your website content from these datasets. It all feels rather futile. 

A final word

The rapidly evolving world of AIs is intimidating, and whether you decide to attempt to block LLMs from your site or not, we’d recommend that it’s worth genning up on the subject. Whatever decision you make should be active and informed by up-to-date knowledge. 

A not insignificant 32.9% of the top 1,000 websites on the internet have elected to block the GPTBot. However, for many the growing opportunity presented by AI, combined with the futility of trying to resist the tide, has led to the decision that blocking AI is not the right move. At least for now. 

Categories: News, Strategy


You Might Also Like
Read Full PostRead - Eye Icon
Transforming the Pharmaceutical Landscape
Innovation
03/04/2018Transforming the Pharmaceutical Landscape

Sintetica S.A. is a pharmaceutical company delivering injectable anaesthetics and analgesics to patients worldwide through innovative science and excellence in development, production and marketing.

Read Full PostRead - Eye Icon
Employment Law at its Finest
Leadership
18/06/2019Employment Law at its Finest

Flichy Grangé Avocats is a law firm specialising in employment law and social law for companies. Recently, it was awarded Most Outstanding Employment Law Firm 2018 – France for its work.

Read Full PostRead - Eye Icon
Advantages of ACH Transfers
News
08/07/2022Advantages of ACH Transfers

Advantages of ACH Transfers ACH Transfers or eChecks help eliminate transaction problems for businesses and their clients. Paper checks are inconvenient because clients need to remember to pay the bill, and businesses must visit the bank each time they want to

Read Full PostRead - Eye Icon
The UK’s Most Personalised Venous Treatments
Innovation
26/01/2024The UK’s Most Personalised Venous Treatments

Accredited with the title of Best Varicose Veins Treatment Clinic 2023 – UK, The Whiteley Clinic is an internationally renowned centre whose specialty lies in the investigation and treatment of venous diseases. Promising the highest level of care through acc

Read Full PostRead - Eye Icon
Hungary: A Soaring Economy, Beating the Odds
Legal
03/03/2016Hungary: A Soaring Economy, Beating the Odds

Barkassy Grünfeld is a “new-wave law firm” which breaks the traditional approach of counselling and introduces new progressive concepts in pricing and legal services.

Read Full PostRead - Eye Icon
Most Innovative Medical Device Manufacturer 2024 – Southern California, MedTech CEO of the Year 2024 (California): Perry Brunette
Innovation
28/05/2024Most Innovative Medical Device Manufacturer 2024 – Southern California, MedTech CEO of the Year 2024 (California): Perry Brunette

In the healthcare sector, Artificial Intelligence (AI) continues to prove itself a true gamechanger, whether used to organise patient data or in the form of robots to assist in surgery

Read Full PostRead - Eye Icon
What to Think About When Leaving Part of a Death Benefit to a Charity or Organisation whole Life Insurance
Finance
12/12/2022What to Think About When Leaving Part of a Death Benefit to a Charity or Organisation whole Life Insurance

Did you know you can leave part (or all) of your death benefit from a life insurance policy as a donation to a charity or nonprofit organisation? Permanent life insurance policies, such as whole life insurance and universal life insurance, with their guarantee

Read Full PostRead - Eye Icon
CEO of the Year – Belgium
Innovation
06/05/2016CEO of the Year – Belgium

Kollector has developed an information system which provides real time statistics about radio broadcasting and online distribution, worldwide.

Read Full PostRead - Eye Icon
Structuring your Business for Exit
Legal
28/10/2015Structuring your Business for Exit

Founded in 1980, BAE, KIM & LEE LLC is one of the oldest law firms in Korea.



Our Trusted Brands

Acquisition International is a flagship brand of AI Global Media. AI Global Media is a B2B enterprise and are committed to creating engaging content allowing businesses to market their services to a larger global audience. We have 14 unique brands, each of which serves a specific industry or region. Each brand covers the latest news in its sector and publishes a digital magazine and newsletter which is read by a global audience.

Arrow