robots.txt for AI Crawlers: How to Let GPTBot, ClaudeBot, and PerplexityBot In
Your robots.txt file determines whether AI models know your business exists. GPTBot, ClaudeBot, Google-Extended, and PerplexityBot all respect it. Blocking them means zero citations when someone asks an AI about your industry. AgentHermes checks this in D1 Discoverability — and most businesses get it wrong.
The Invisible Business Problem
When someone asks ChatGPT “What is the best CRM for small businesses?” or asks Perplexity “Which HVAC company in Austin has the best reviews?”, the AI constructs its answer from content it has crawled and indexed. If your robots.txt blocks the AI crawler, your content is not in the training data and not available for real-time browsing. You are invisible.
This is not a theoretical risk. Our scans show that 38% of businesses with a robots.txt file have rules that block at least one major AI crawler — usually unintentionally. A blanket User-agent: * / Disallow: /blocks everything, including AI crawlers. More commonly, businesses block “bots” they do not recognize without realizing that GPTBot and ClaudeBot are among them.
The result is a new form of digital invisibility. Your SEO might be perfect — you rank #1 on Google — but if AI crawlers are blocked, you get zero citations in AI-generated answers. This matters because AI answer engines are the fastest-growing referral channel in 2026, and they are cannibalizing traditional search clicks.
AI Crawlers You Should Allow
These crawlers power the AI models that millions of people use daily. Allowing them means your business gets cited in AI-generated answers. Every one of them respects robots.txt — if you allow them, they crawl your public pages; if you block them, they skip you entirely.
GPTBot
OpenAITrains GPT models, powers ChatGPT Browse and Search
Being in ChatGPT training data means getting cited when users ask about your industry.
ChatGPT-User
OpenAIReal-time browsing when ChatGPT users ask questions
Blocking this means ChatGPT cannot read your pages when users ask about you directly.
anthropic-ai / ClaudeBot
AnthropicTrains Claude models, powers Claude web search
Claude is used by millions of professionals. Your business should be in its knowledge base.
Google-Extended
GoogleTrains Gemini models (separate from Googlebot for Search)
Controls whether your content trains Gemini. Does NOT affect Google Search ranking.
PerplexityBot
Perplexity AIPowers Perplexity search answers with citations
Perplexity cites sources. Getting crawled = getting cited = free qualified traffic.
Bytespider
ByteDanceTrains TikTok AI models
Optional. Large user base but less direct business value than search-oriented crawlers.
Scrapers You Should Block
Not all bots are created equal. These crawlers extract your data for resale or bulk datasets with no direct benefit to your business. Blocking them is good practice.
CCBot
Common CrawlBulk dataset for anyone to train on
Your content ends up in open datasets used by competitors. No direct benefit to you.
Diffbot
DiffbotStructured data extraction for resale
Extracts and resells your product data, pricing, and content. Pure scraping.
Scrapy
VariousOpen-source scraping framework default user-agent
Generic scraping. No AI training benefit. Usually competitive intelligence harvesting.
Copy-Paste robots.txt Template
Drop this into your robots.txt file. It allows all major AI crawlers and search engines while blocking known scrapers. Customize the Disallow paths for any private sections of your site.
# =========================================== # robots.txt — AI-Optimized Configuration # Generated by AgentHermes (agenthermes.ai) # =========================================== # --- Search Engines (always allow) --- User-agent: Googlebot Allow: / User-agent: Bingbot Allow: / # --- AI Crawlers (allow for GEO) --- User-agent: GPTBot Allow: / User-agent: ChatGPT-User Allow: / User-agent: anthropic-ai Allow: / User-agent: ClaudeBot Allow: / User-agent: Google-Extended Allow: / User-agent: PerplexityBot Allow: / User-agent: Bytespider Allow: / # --- Scrapers (block) --- User-agent: CCBot Disallow: / User-agent: Diffbot Disallow: / User-agent: Scrapy Disallow: / # --- Default: allow everything else --- User-agent: * Allow: / Disallow: /api/ Disallow: /admin/ Disallow: /dashboard/ # --- Sitemap --- Sitemap: https://yourdomain.com/sitemap.xml
Pro tip: Combine with llms.txt for maximum AI visibility. robots.txt controls who can crawl your site. llms.txt tells AI models what your business actually does, in a format optimized for LLM consumption. Together, they cover both discoverability (can the AI find you?) and comprehension (does the AI understand you?). Both are checked in D1 Discoverability.
How AgentHermes Checks Your robots.txt
When you run an Agent Readiness Scan, AgentHermes fetches your robots.txt and checks three things:
AI Crawler Access
Are GPTBot, ClaudeBot, Google-Extended, and PerplexityBot allowed? Blocking any of the four major AI crawlers reduces your D1 score.
Sitemap Declaration
Does robots.txt include a Sitemap directive? AI crawlers use sitemaps to discover all your pages efficiently. Without one, crawlers may miss important content.
Overly Restrictive Rules
Is there a blanket "Disallow: /" for User-agent: * that would block unknown future AI crawlers? The ideal config explicitly allows known good bots and only blocks known bad ones.
The D1 Discoverability dimension carries a weight of 0.12 in your overall Agent Readiness Score. robots.txt configuration is one of several signals within D1, alongside llms.txt presence, agent-card.json, Schema.org markup, and AGENTS.md. But robots.txt is the most foundational — if AI crawlers cannot access your site at all, nothing else in D1 matters.
5 Common robots.txt Mistakes That Block AI Crawlers
Blanket wildcard block
"User-agent: * / Disallow: /" blocks everything — including AI crawlers. This is the nuclear option. Use explicit blocks instead.
User-agent: * Disallow: /
Blocking "bot" in the name
Some WAFs and plugins block any user-agent containing "bot". This catches GPTBot, ClaudeBot, and PerplexityBot — the exact crawlers you want.
# WAF rule: block *bot*
Forgetting ChatGPT-User
GPTBot trains the model. ChatGPT-User browses in real-time. Allowing GPTBot but blocking ChatGPT-User means ChatGPT cannot read your pages when users ask about you live.
User-agent: GPTBot Allow: / # Missing ChatGPT-User
No Sitemap directive
Even if crawlers are allowed, they need a sitemap to find all your pages efficiently. Without it, important pages like pricing and product catalogs may never be crawled.
# Missing: # Sitemap: https://...
Frequently Asked Questions
Does blocking GPTBot affect my Google Search ranking?
No. GPTBot is separate from Googlebot. Blocking GPTBot has zero effect on Google Search results. Similarly, blocking Google-Extended only affects Gemini AI training, not your search ranking. Googlebot (for Search) should always be allowed.
Will allowing AI crawlers steal my content?
AI crawlers use your content to build knowledge, not to display it verbatim. When ChatGPT or Perplexity answers a question about your industry, they synthesize information from thousands of sources — including yours, if you allow crawling. The risk of not being crawled (invisibility) far outweighs the risk of being crawled (your content contributing to AI knowledge). If you have genuinely proprietary data (research papers, paid content), you can block specific paths while allowing your public pages.
How does robots.txt affect my Agent Readiness Score?
AgentHermes checks robots.txt as part of D1 Discoverability (weight: 0.12). Specifically, we check whether AI crawler user-agents are blocked. A blanket "Disallow: /" for all bots scores 0 on crawler accessibility. Selectively allowing AI crawlers while blocking scrapers is the optimal configuration and earns full D1 crawler points.
What if I use a CMS like WordPress or Shopify — can I edit robots.txt?
Yes. WordPress lets you edit robots.txt via plugins like Yoast SEO or directly via the theme. Shopify generates robots.txt automatically but allows customization through the robots.txt.liquid template in your theme. Squarespace, Wix, and other builders have robots.txt settings in their SEO panels. Every major CMS supports this.
How often should I update my robots.txt for AI crawlers?
Review quarterly. New AI crawlers appear regularly — in the last 12 months, PerplexityBot, Bytespider, and several others launched. AgentHermes tracks new AI crawler user-agents and flags when your robots.txt is missing rules for newly relevant bots. The template in this article covers all current major AI crawlers as of April 2026.
Is your robots.txt blocking AI crawlers?
Run a free Agent Readiness Scan to check your robots.txt, llms.txt, agent-card.json, and 50+ other signals across all 9 dimensions.