Which AI crawlers should I allow?

[GPTBot](https://developers.openai.com/api/docs/bots) (OpenAI), [ClaudeBot](https://support.claude.com/en/articles/8896518-does-anthropic-crawl-data-from-the-web-and-how-can-site-owners-block-the-crawler) (Anthropic), [Google-Extended](https://developers.google.com/crawling/docs/crawlers-fetchers/overview-google-crawlers) (Google AI), and PerplexityBot (Perplexity) are the main ones. Allow all of them unless you have a specific reason not to. Check your [`robots.txt`](https://developers.google.com/search/docs/crawling-indexing/robots/intro) to make sure none are inadvertently blocked.

Technical

AI Crawler

Q: How do I check if AI crawlers can access my site?

Check your `robots.txt` file for `User-agent: GPTBot`, `User-agent: ClaudeBot`, `User-agent: Google-Extended`, and `User-agent: PerplexityBot`. If any are set to `Disallow: /`, those crawlers can't access your content. Also check for overly broad `Disallow` rules that might block them unintentionally.

PMPrompt Metrics·Feb 5, 2026·Updated Feb 28, 2026·3 min read

What is AI Crawler?

An AI crawler is a bot deployed by AI companies to scan, index, and ingest web content for training data or real-time retrieval. GPTBot, ClaudeBot, Google-Extended: these are how AI models discover and access your content.

How they work

AI crawlers function like search engine crawlers but serve different purposes. They scan web pages to collect content for:

Pre-training data: building the model's base knowledge
Fine-tuning: improving model performance on specific domains
Real-time retrieval: RAG systems that fetch current content

Each AI company runs its own crawler with its own behavior, rate limits, and robots.txt directives:

Crawler	Company	Purpose
GPTBot	OpenAI	Training + browsing
ClaudeBot	Anthropic	Training data
Google-Extended	Google	AI features (Gemini, AI Overviews)
PerplexityBot	Perplexity	Real-time search retrieval

Robots.txt and AI access

Your robots.txt controls which AI crawlers can reach your content. Many sites accidentally block AI crawlers through:

Overly restrictive blanket rules (User-agent: * / Disallow: /)
Not explicitly allowing newer AI bots
CMS default settings that block unknown crawlers
CDN or firewall rules that rate-limit or block bot traffic

A site audit should verify that major AI crawlers have access to your content pages. This is table stakes. If crawlers can't reach your content, nothing else in your GEO strategy matters.

Making your content crawl-friendly

Beyond allowing access, optimize for AI crawler efficiency:

Clean HTML structure: semantic elements (<article>, <section>, <h1>-<h6>)
Schema.org markup: JSON-LD structured data AI systems can parse directly
Fast page loads: slow pages may not get fully crawled
Clear content hierarchy: headings and sections that map to topic structure
Minimal JavaScript rendering: content available in initial HTML, not behind JS execution
XML sitemap: include all content pages you want AI to discover
llms.txt: publish a structured summary of key pages with the llms.txt generator

These technical basics make your content easier for AI systems to parse, understand, and reference in their responses. Prompt Metrics tracks whether AI models are actually discovering and citing your content.

Frequently Asked Questions

Blocking them prevents your content from being used in AI training and retrieval, which means AI models won't reference or recommend your brand. For most businesses seeking AI visibility, you want them crawling your site. Block only if you have specific IP protection concerns.

GPTBot (OpenAI), ClaudeBot (Anthropic), Google-Extended (Google AI), and PerplexityBot (Perplexity) are the main ones. Allow all of them unless you have a specific reason not to. Check your robots.txt to make sure none are inadvertently blocked.

Check your robots.txt file for User-agent: GPTBot, User-agent: ClaudeBot, User-agent: Google-Extended, and User-agent: PerplexityBot. If any are set to Disallow: /, those crawlers can't access your content. Also check for overly broad Disallow rules that might block them unintentionally.

The major AI companies (OpenAI, Anthropic, Google, and Perplexity) have committed to respecting robots.txt directives. However, enforcement varies and smaller AI companies may not always comply. Structured data provides an additional layer of control over how your content is used.

AI Crawler

What is AI Crawler?

How they work

Robots.txt and AI access

Making your content crawl-friendly

Related Terms

Frequently Asked Questions

See what AI actually says about you

AI Crawler

What is AI Crawler?

How they work

Robots.txt and AI access

Making your content crawl-friendly

Related Terms

Frequently Asked Questions

Should I block AI crawlers?

Which AI crawlers should I allow?

How do I check if AI crawlers can access my site?

Do AI crawlers respect robots.txt?

See what AI actually says about you