ChatGPT has been OpenAI’s favourite child since it came to the fore earlier this year. While not the AI research firms only product (See the now-defunct AI Classifier) it seemed like OpenAI’s first and only priority – until now. This week sees the release of GPTBot, the AI web crawler training the AI models of the future.
What is GPTBot, OpenAI’s web crawler?
Artificial intelligence firm OpenAI has released a new web crawling tool for use alongside their current GPT-4 model. The former does not replace the latter, but rather augments its capabilities. Try “What is ChatGPT and how can you use it?” or “How to use ChatGPT on mobile” for further reading on ChatGPT.
This new type of bot (at least for the company) will ‘scrape’ the internet for useful data on which future ChatGPT models will be based. This technology is sometimes referred to as a web spider. The technology has been around for a while, and in fact search engines including Google and Bing use them to inform useful search results.
Essential AI Tools
Content Guardian – AI Content Checker – One-click, Eight Checks
Originality AI detector
Jasper AI
WordAI
Copy.ai
What is an AI web crawler?
AI systems can improve the accuracy of this technology by using machine learning to recognize increasingly accurate content to recommend based on its indexing. In other words, the entire internet is like a library. Individual websites are, in this analogy, the books. A web crawler, then, is the librarian. It helps you find what you need, much faster than you would be able to by yourself. Internet data must be organised in some systematic way, analoguous to the Dewey Decimal System. This is the ‘indexing’ that such bots as GPTBot uses to look at and understand the internet, and how all the websites relate to eachother.
By systematically aggregating the available data from the content of websites, OpenAI hopes to improve the capabilities of future iterations beyond what would otherwise be possible. As explained in a blog post, “Allowing GPTBot to access your site can help AI models become more accurate and improve their general capabilities and safety.”
LLMs (Large Language Models) are trained on massive data sets, and nothing short of this approach seems sufficient to produce future models more ‘well-read’ than the one’s we know and love today.
How to opt-out of GPTBot
However, feeding endless raw data into the maw of the machine with zero scrutiny would be nothing short of data poisoning. The resulting model would be reckless, incoherent, hateful, and bloated with false information.
To prevent this, the web pages GPTBot is allowed to access have been filtered to disallow “sources that require paywall access, are known to gather personally identifiable information (PII), or have text that violates our policies.”
In full, the filter will disallow:
- Websites that require paywall access (This would be disastrous to the publishing business model)
- Websites known to gather personally identifiable information (Dissemination of this private information would cause a class action lawsuit regarding GDPR)
- Websites with content that violates OpenAI policies (Due to risk of spreading disinformation, hateful and biased or otherwise inappropriate content)
- Websites that opt-out
If you’re not a web developer, don’t worry about this bit; The blog-post then shows the IP address to expect calls from, and how to opt-out. By adding the following GPTBot user agent string to robots.txt on your web server, your entire website domain will be avoided by the web scraper:
User-agent: GPTBot
Disallow: /
openai.com
By adding the following GPTBot token, you can manually select a directory (subdomain) to allow or disallow.
User-agent: GPTBot
Allow: /directory-1/
Disallow: /directory-2/
openai.com
OpenAI co-founder Sam Altman launches GPTBot less than a month after filing a trademark application for “GPT-5” on July 18th. The application covers the use of the term “GPT-5,” which includes downloadable software for every aspect of NLP, such as speech-to-text, AI-based natural language processing, and audio analysis for speech recognition.
Is GPTBot safe?
The opt-out (as opposed to opt-in) nature of implementation sparks ethics debates from internet users concerned about the use of their personal information. Data used by these systems is required to be anonymized and aggregated. This means that an LLM data set may include the average height of the population of the US, and that 2,006 people in the US are called America (a true fact) but would not be able to tell you any of those specific 2,006 peoples height.
Brayden Lindrea reports via Cointelegraph that “In late June, a class action was filed against OpenAI by 16 plaintiffs alleging the AI firm to have accessed private information from ChatGPT user interactions. If these allegations are proven to be accurate, OpenAI — and Microsoft, which was named as a defendant — will be in breach of the Computer Fraud and Abuse Act, a law with a precedent for web-scraping cases.