Skip to main content
Standards GuideDiscoverability

robots.txt for AI Crawlers: How to Let GPTBot, ClaudeBot, and PerplexityBot In

Your robots.txt file determines whether AI models know your business exists. GPTBot, ClaudeBot, Google-Extended, and PerplexityBot all respect it. Blocking them means zero citations when someone asks an AI about your industry. AgentHermes checks this in D1 Discoverability — and most businesses get it wrong.

AH
AgentHermes Research
April 15, 202612 min read

The Invisible Business Problem

When someone asks ChatGPT “What is the best CRM for small businesses?” or asks Perplexity “Which HVAC company in Austin has the best reviews?”, the AI constructs its answer from content it has crawled and indexed. If your robots.txt blocks the AI crawler, your content is not in the training data and not available for real-time browsing. You are invisible.

This is not a theoretical risk. Our scans show that 38% of businesses with a robots.txt file have rules that block at least one major AI crawler — usually unintentionally. A blanket User-agent: * / Disallow: /blocks everything, including AI crawlers. More commonly, businesses block “bots” they do not recognize without realizing that GPTBot and ClaudeBot are among them.

The result is a new form of digital invisibility. Your SEO might be perfect — you rank #1 on Google — but if AI crawlers are blocked, you get zero citations in AI-generated answers. This matters because AI answer engines are the fastest-growing referral channel in 2026, and they are cannibalizing traditional search clicks.

38%
block AI crawlers
0.12
D1 weight in score
6
major AI crawlers
2 min
to fix robots.txt

AI Crawlers You Should Allow

These crawlers power the AI models that millions of people use daily. Allowing them means your business gets cited in AI-generated answers. Every one of them respects robots.txt — if you allow them, they crawl your public pages; if you block them, they skip you entirely.

GPTBot

OpenAI
Allow

Trains GPT models, powers ChatGPT Browse and Search

Being in ChatGPT training data means getting cited when users ask about your industry.

ChatGPT-User

OpenAI
Allow

Real-time browsing when ChatGPT users ask questions

Blocking this means ChatGPT cannot read your pages when users ask about you directly.

anthropic-ai / ClaudeBot

Anthropic
Allow

Trains Claude models, powers Claude web search

Claude is used by millions of professionals. Your business should be in its knowledge base.

Google-Extended

Google
Allow

Trains Gemini models (separate from Googlebot for Search)

Controls whether your content trains Gemini. Does NOT affect Google Search ranking.

PerplexityBot

Perplexity AI
Allow

Powers Perplexity search answers with citations

Perplexity cites sources. Getting crawled = getting cited = free qualified traffic.

Bytespider

ByteDance
Allow

Trains TikTok AI models

Optional. Large user base but less direct business value than search-oriented crawlers.

Scrapers You Should Block

Not all bots are created equal. These crawlers extract your data for resale or bulk datasets with no direct benefit to your business. Blocking them is good practice.

CCBot

Common Crawl
Block

Bulk dataset for anyone to train on

Your content ends up in open datasets used by competitors. No direct benefit to you.

Diffbot

Diffbot
Block

Structured data extraction for resale

Extracts and resells your product data, pricing, and content. Pure scraping.

Scrapy

Various
Block

Open-source scraping framework default user-agent

Generic scraping. No AI training benefit. Usually competitive intelligence harvesting.

Copy-Paste robots.txt Template

Drop this into your robots.txt file. It allows all major AI crawlers and search engines while blocking known scrapers. Customize the Disallow paths for any private sections of your site.

robots.txt
# ===========================================
# robots.txt — AI-Optimized Configuration
# Generated by AgentHermes (agenthermes.ai)
# ===========================================

# --- Search Engines (always allow) ---
User-agent: Googlebot
Allow: /

User-agent: Bingbot
Allow: /

# --- AI Crawlers (allow for GEO) ---
User-agent: GPTBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: anthropic-ai
Allow: /

User-agent: ClaudeBot
Allow: /

User-agent: Google-Extended
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Bytespider
Allow: /

# --- Scrapers (block) ---
User-agent: CCBot
Disallow: /

User-agent: Diffbot
Disallow: /

User-agent: Scrapy
Disallow: /

# --- Default: allow everything else ---
User-agent: *
Allow: /
Disallow: /api/
Disallow: /admin/
Disallow: /dashboard/

# --- Sitemap ---
Sitemap: https://yourdomain.com/sitemap.xml

Pro tip: Combine with llms.txt for maximum AI visibility. robots.txt controls who can crawl your site. llms.txt tells AI models what your business actually does, in a format optimized for LLM consumption. Together, they cover both discoverability (can the AI find you?) and comprehension (does the AI understand you?). Both are checked in D1 Discoverability.

How AgentHermes Checks Your robots.txt

When you run an Agent Readiness Scan, AgentHermes fetches your robots.txt and checks three things:

AI Crawler Access

Are GPTBot, ClaudeBot, Google-Extended, and PerplexityBot allowed? Blocking any of the four major AI crawlers reduces your D1 score.

Sitemap Declaration

Does robots.txt include a Sitemap directive? AI crawlers use sitemaps to discover all your pages efficiently. Without one, crawlers may miss important content.

Overly Restrictive Rules

Is there a blanket "Disallow: /" for User-agent: * that would block unknown future AI crawlers? The ideal config explicitly allows known good bots and only blocks known bad ones.

The D1 Discoverability dimension carries a weight of 0.12 in your overall Agent Readiness Score. robots.txt configuration is one of several signals within D1, alongside llms.txt presence, agent-card.json, Schema.org markup, and AGENTS.md. But robots.txt is the most foundational — if AI crawlers cannot access your site at all, nothing else in D1 matters.

5 Common robots.txt Mistakes That Block AI Crawlers

Blanket wildcard block

"User-agent: * / Disallow: /" blocks everything — including AI crawlers. This is the nuclear option. Use explicit blocks instead.

User-agent: *
Disallow: /

Blocking "bot" in the name

Some WAFs and plugins block any user-agent containing "bot". This catches GPTBot, ClaudeBot, and PerplexityBot — the exact crawlers you want.

# WAF rule: block *bot*

Forgetting ChatGPT-User

GPTBot trains the model. ChatGPT-User browses in real-time. Allowing GPTBot but blocking ChatGPT-User means ChatGPT cannot read your pages when users ask about you live.

User-agent: GPTBot
Allow: /
# Missing ChatGPT-User

No Sitemap directive

Even if crawlers are allowed, they need a sitemap to find all your pages efficiently. Without it, important pages like pricing and product catalogs may never be crawled.

# Missing:
# Sitemap: https://...

Frequently Asked Questions

Does blocking GPTBot affect my Google Search ranking?

No. GPTBot is separate from Googlebot. Blocking GPTBot has zero effect on Google Search results. Similarly, blocking Google-Extended only affects Gemini AI training, not your search ranking. Googlebot (for Search) should always be allowed.

Will allowing AI crawlers steal my content?

AI crawlers use your content to build knowledge, not to display it verbatim. When ChatGPT or Perplexity answers a question about your industry, they synthesize information from thousands of sources — including yours, if you allow crawling. The risk of not being crawled (invisibility) far outweighs the risk of being crawled (your content contributing to AI knowledge). If you have genuinely proprietary data (research papers, paid content), you can block specific paths while allowing your public pages.

How does robots.txt affect my Agent Readiness Score?

AgentHermes checks robots.txt as part of D1 Discoverability (weight: 0.12). Specifically, we check whether AI crawler user-agents are blocked. A blanket "Disallow: /" for all bots scores 0 on crawler accessibility. Selectively allowing AI crawlers while blocking scrapers is the optimal configuration and earns full D1 crawler points.

What if I use a CMS like WordPress or Shopify — can I edit robots.txt?

Yes. WordPress lets you edit robots.txt via plugins like Yoast SEO or directly via the theme. Shopify generates robots.txt automatically but allows customization through the robots.txt.liquid template in your theme. Squarespace, Wix, and other builders have robots.txt settings in their SEO panels. Every major CMS supports this.

How often should I update my robots.txt for AI crawlers?

Review quarterly. New AI crawlers appear regularly — in the last 12 months, PerplexityBot, Bytespider, and several others launched. AgentHermes tracks new AI crawler user-agents and flags when your robots.txt is missing rules for newly relevant bots. The template in this article covers all current major AI crawlers as of April 2026.


Is your robots.txt blocking AI crawlers?

Run a free Agent Readiness Scan to check your robots.txt, llms.txt, agent-card.json, and 50+ other signals across all 9 dimensions.


Share this article: