Home > AI

OpenAI launches GPTBot AI web crawler

What is OpenAI's new GPTBot AI web crawler? Sam Altmans latest brainchild which could create GPT-5.

Reviewed By: Kevin Pocock

Last Updated on September 5, 2023
OpenAI launches GPTBot web crawler.
PC Guide is reader-supported. When you buy through links on our site, we may earn an affiliate commission. Read More
You can trust PC Guide: Our team of experts use a combination of independent consumer research, in-depth testing where appropriate - which will be flagged as such, and market analysis when recommending products, software and services. Find out how we test here.

ChatGPT has been OpenAI’s favourite child since it came to the fore earlier this year. While not the AI research firms only product (See the now-defunct AI Classifier) it seemed like OpenAI’s first and only priority – until now. This week sees the release of GPTBot, the AI web crawler training the AI models of the future.

What is GPTBot, OpenAI’s web crawler?

Artificial intelligence firm OpenAI has released a new web crawling tool for use alongside their current GPT-4 model. The former does not replace the latter, but rather augments its capabilities. Try “What is ChatGPT and how can you use it?” or “How to use ChatGPT on mobile” for further reading on ChatGPT.

This new type of bot (at least for the company) will ‘scrape’ the internet for useful data on which future ChatGPT models will be based. This technology is sometimes referred to as a web spider. The technology has been around for a while, and in fact search engines including Google and Bing use them to inform useful search results.

Essential AI Tools

Editor’s pick
Only $0.00019 per word!

Content Guardian – AI Content Checker – One-click, Eight Checks

8 Market leading AI Content Checkers in ONE click. The only 8-in-1 AI content detector platform in the world. We integrate with leading AI content detectors to give unparalleled confidence that your content appear to be written by a human.
Only $0.01 per 100 words

Originality AI detector

Originality.AI Is The Most Accurate AI Detection.Across a testing data set of 1200 data samples it achieved an accuracy of 96% while its closest competitor achieved only 35%. Useful Chrome extension. Detects across emails, Google Docs, and websites.
EXCLUSIVE DEAL 10,000 free bonus credits

Jasper AI

On-brand AI content wherever you create. 100,000+ customers creating real content with Jasper. One AI tool, all the best models.
TRY FOR FREE

WordAI

10x Your Content Output With AI. Key features – No duplicate content, full control, in built AI content checker. Free trial available.
TRY FOR FREE

Copy.ai

Experience the full power of an AI content generator that delivers premium results in seconds. 8 million users enjoy writing blogs 10x faster, effortlessly creating higher converting social media posts or writing more engaging emails. Sign up for a free trial.

What is an AI web crawler?

AI systems can improve the accuracy of this technology by using machine learning to recognize increasingly accurate content to recommend based on its indexing. In other words, the entire internet is like a library. Individual websites are, in this analogy, the books. A web crawler, then, is the librarian. It helps you find what you need, much faster than you would be able to by yourself. Internet data must be organised in some systematic way, analoguous to the Dewey Decimal System. This is the ‘indexing’ that such bots as GPTBot uses to look at and understand the internet, and how all the websites relate to eachother.

By systematically aggregating the available data from the content of websites, OpenAI hopes to improve the capabilities of future iterations beyond what would otherwise be possible. As explained in a blog post, “Allowing GPTBot to access your site can help AI models become more accurate and improve their general capabilities and safety.”

LLMs (Large Language Models) are trained on massive data sets, and nothing short of this approach seems sufficient to produce future models more ‘well-read’ than the one’s we know and love today.

How to opt-out of GPTBot

However, feeding endless raw data into the maw of the machine with zero scrutiny would be nothing short of data poisoning. The resulting model would be reckless, incoherent, hateful, and bloated with false information.

To prevent this, the web pages GPTBot is allowed to access have been filtered to disallow “sources that require paywall access, are known to gather personally identifiable information (PII), or have text that violates our policies.”

In full, the filter will disallow:

  • Websites that require paywall access (This would be disastrous to the publishing business model)
  • Websites known to gather personally identifiable information (Dissemination of this private information would cause a class action lawsuit regarding GDPR)
  • Websites with content that violates OpenAI policies (Due to risk of spreading disinformation, hateful and biased or otherwise inappropriate content)
  • Websites that opt-out

If you’re not a web developer, don’t worry about this bit; The blog-post then shows the IP address to expect calls from, and how to opt-out. By adding the following GPTBot user agent string to robots.txt on your web server, your entire website domain will be avoided by the web scraper:

User-agent: GPTBot

Disallow: /

openai.com

By adding the following GPTBot token, you can manually select a directory (subdomain) to allow or disallow.

User-agent: GPTBot

Allow: /directory-1/

Disallow: /directory-2/

openai.com

OpenAI co-founder Sam Altman launches GPTBot less than a month after filing a trademark application for “GPT-5” on July 18th. The application covers the use of the term “GPT-5,” which includes downloadable software for every aspect of NLP, such as speech-to-text, AI-based natural language processing, and audio analysis for speech recognition.

Is GPTBot safe?

The opt-out (as opposed to opt-in) nature of implementation sparks ethics debates from internet users concerned about the use of their personal information. Data used by these systems is required to be anonymized and aggregated. This means that an LLM data set may include the average height of the population of the US, and that 2,006 people in the US are called America (a true fact) but would not be able to tell you any of those specific 2,006 peoples height.

Brayden Lindrea reports via Cointelegraph that “In late June, a class action was filed against OpenAI by 16 plaintiffs alleging the AI firm to have accessed private information from ChatGPT user interactions. If these allegations are proven to be accurate, OpenAI — and Microsoft, which was named as a defendant — will be in breach of the Computer Fraud and Abuse Act, a law with a precedent for web-scraping cases.

Steve is an AI Content Writer for PC Guide, writing about all things artificial intelligence. He currently leads the AI reviews on the website.