AI models learn by ingesting massive amounts of web content. That’s great for innovation—but if you own a website, you may want to control how, when, and by whom your content is copied, indexed, or used to train large language models (LLMs). In this Watsspace Digital Marketing Blog guide, we’ll show you exactly how to block AI web crawlers with practical configurations, robust edge rules, and policy signals—all while preserving your SEO and site performance.
Why block AI web crawlers in the first place?
Before diving into configuration, it helps to clarify your goals. Website owners choose to block or limit AI crawlers for several reasons:
- Intellectual property protection: Your content may be copyrighted, licensed, or behind membership walls.
- Revenue protection: Uncontrolled scraping can undermine subscriptions, ads, or lead generation funnels.
- Brand and accuracy risk: Hallucinated or outdated AI outputs can misrepresent your expertise.
- Compliance: In regulated industries, copying data may create privacy, consent, or retention issues.
- Resource costs: High-volume crawlers stress servers and CDNs, inflating bandwidth bills and slowing real users.
Multiple independent reports show bots are no fringe issue. According to the Imperva 2024 Bad Bot Report, 49.6% of internet traffic in 2023 was non-human, and bad bots accounted for a record 32%. That’s a meaningful operational and security surface. On the AI side, Common Crawl describes indexing billions of pages per monthly crawl for research and model training. Public sentiment has shifted too: a Pew Research Center, 2024 study found that a majority of Americans are more concerned than excited about AI’s impact, underlining a growing desire for control.
How AI crawlers work (and how they identify themselves)
AI crawlers are automated agents—often with declared user-agent strings—that fetch page content, metadata, and media. Responsible actors publish user-agent names, offer opt-out mechanisms via robots.txt and HTTP headers, and obey crawl-delay signals. Others may not.
Key characteristics you can use to identify and control them:
- User-agent tokens: Strings like GPTBot, CCBot, ClaudeBot, PerplexityBot, Google-Extended, Applebot-Extended, Meta-External, and Bytespider.
- Declared IP ranges: Some providers publish IP ranges you can allow/deny at the edge.
- Robots directives: Most reputable AI crawlers honor robots.txt blocks.
- Headers and meta signals: De facto tags such as noai and noimageai are increasingly supported, alongside X-Robots-Tag headers.
Important caveat: user-agents can be spoofed. Blocking via robots.txt is polite and low-friction, but it’s not a security boundary. For robust defense, combine robots.txt policies with WAF rules, server-level filtering, and monitoring.
Quick-start: the fastest way to block AI crawlers
If you want a fast, low-effort solution while you design a comprehensive policy, start with robots.txt. It’s simple, immediate to deploy, and broadly respected by legitimate AI crawlers.
# /robots.txt quick-start block list (edit to suit)
User-agent: GPTBot
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: anthropic-ai
Disallow: /
User-agent: PerplexityBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: Applebot-Extended
Disallow: /
User-agent: Meta-External
Disallow: /
User-agent: FacebookBot
Disallow: /
User-agent: Bytespider
Disallow: /
This cuts off many major LLM-related fetchers from the start. Next, layer in X-Robots-Tag and optional noai/noimageai signals, then strengthen enforcement with edge and server controls.
Robots.txt recipes to block major AI web crawlers
Use targeted directives so you can allow good bots (like search engines) while restricting model training crawlers.
Block OpenAI crawlers (GPTBot and OAI-SearchBot)
User-agent: GPTBot
Disallow: /
# OpenAI Search crawler (if applicable to your policy)
User-agent: OAI-SearchBot
Disallow: /
# ChatGPT browser fetcher
User-agent: ChatGPT-User
Disallow: /
OpenAI states GPTBot respects robots.txt and publishes IP ranges for allow/deny lists.
Block Common Crawl (CCBot)
User-agent: CCBot
Disallow: /
Common Crawl honors robots.txt and is a core upstream data source for many AI datasets.
Block Anthropic crawlers (ClaudeBot and anthropic-ai)
User-agent: ClaudeBot
Disallow: /
User-agent: anthropic-ai
Disallow: /
Block Perplexity (PerplexityBot)
User-agent: PerplexityBot
Disallow: /
Control Google’s generative AI usage (Google-Extended)
# Note: This does not block Google Search indexing by Googlebot.
User-agent: Google-Extended
Disallow: /
Google introduced the Google-Extended control to limit the use of your content for certain generative AI features (e.g., Gemini and Vertex AI generative services). It is separate from Googlebot.
Control Apple’s model training (Applebot-Extended)
User-agent: Applebot-Extended
Disallow: /
Apple provides Applebot-Extended to allow or opt out of certain training uses, separate from Applebot for search/Siri.
Control Meta’s AI crawlers (Meta-External and FacebookBot)
User-agent: Meta-External
Disallow: /
User-agent: FacebookBot
Disallow: /
Meta has communicated that site owners can use these product tokens to limit certain AI-related uses.
Block ByteDance crawler (Bytespider)
User-agent: Bytespider
Disallow: /
Allowlist search engines you want
# Example: Allow search crawlers for SEO while blocking AI training bots
User-agent: Googlebot
Disallow:
User-agent: Bingbot
Disallow:
User-agent: DuckDuckBot
Disallow:
Always double-check that you’re not unintentionally blocking key search engine bots that drive organic traffic.
Edge enforcement: block AI crawlers with WAF and CDN rules
Robots.txt is advisory. To ensure policy compliance and protect resources, enforce at the edge with your CDN or WAF (Cloudflare, Akamai, Fastly, AWS WAF, etc.).
- Block by user-agent: Deny requests with AI crawler tokens.
- Rate limit: Throttle suspicious scrapers that obey robots.txt but fetch aggressively.
- IP allow/deny: Block known AI crawler IP ranges, or allow only known search engine IPs.
- Bot score/challenges: Use behavioral scoring or JS challenges for stealthy crawlers.
Example pseudo-rules you can adapt to your WAF:
# Block known AI user-agents
IF http.user_agent contains_any ["GPTBot","OAI-SearchBot","ChatGPT-User","CCBot","ClaudeBot","anthropic-ai","PerplexityBot","Google-Extended","Applebot-Extended","Meta-External","FacebookBot","Bytespider"]
THEN block
# Throttle unknown high-rate agents
IF (requests_from_ip_in_1min > 120) AND (http.user_agent is_unknown)
THEN rate_limit(429)
# Optional: allowlist major search engines by verified ASN/IP
IF reverse_dns_validated_googlebot OR reverse_dns_validated_bingbot
THEN allow
Server-level blocking: Apache and Nginx examples
If you control the origin, add defense in depth with web server rules. These can return 403 (Forbidden) or 410 (Gone) to blocked crawlers.
Apache (.htaccess) example
# Block AI user-agents
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} (GPTBot|OAI-SearchBot|ChatGPT-User|CCBot|ClaudeBot|anthropic-ai|PerplexityBot|Google-Extended|Applebot-Extended|Meta-External|FacebookBot|Bytespider) [NC]
RewriteRule ^ - [F]
# Optional: Deny by IP range (example CIDR placeholders)
# Require all granted
# Deny from 192.0.2.0/24
Nginx example
map $http_user_agent $block_ai {
default 0;
"~*GPTBot" 1;
"~*OAI-SearchBot" 1;
"~*ChatGPT-User" 1;
"~*CCBot" 1;
"~*ClaudeBot" 1;
"~*anthropic-ai" 1;
"~*PerplexityBot" 1;
"~*Google-Extended" 1;
"~*Applebot-Extended" 1;
"~*Meta-External" 1;
"~*FacebookBot" 1;
"~*Bytespider" 1;
}
server {
if ($block_ai) {
return 403;
}
# ...
}
Note: User-agent checks can generate false positives if patterns match legitimate tools. Test in staging before deploying to production.
Send opt-out signals: X-Robots-Tag, noai, and noimageai
Several AI providers honor special headers and meta tags that communicate training restrictions. These signals are not standardized across all vendors but are increasingly recognized.
- X-Robots-Tag (HTTP header): Add to responses to control how non-HTML assets are handled.
- meta robots tags for HTML: Include noai and noimageai values if your policy is to opt out.
Set X-Robots-Tag headers
# Apache
Header set X-Robots-Tag "noai, noimageai"
# Nginx
add_header X-Robots-Tag "noai, noimageai" always;
HTML meta example
<meta name="robots" content="noai, noimageai">
Interpretation varies by provider. Treat these as policy declarations that responsible AI systems will honor. They complement, but do not replace, enforcement mechanisms.
Noindex, canonical, and NoAI: what’s the difference?
It’s easy to mix up directives. Here’s how they differ:
- noindex: Search engines should not index the page in results. Not a training directive.
- canonical: Indicates preferred URL for duplicate content. Not about AI training.
- noai / noimageai: De facto policies asking AI tools not to train on or generate from the content. Not a search index control.
Use noindex for SEO index control, and noai for AI training policy. Combine with robots.txt and server/WAF enforcement for best results.
Verify your blocks are working
Don’t “set and forget.” Verify with active tests and ongoing monitoring:
- Fetch as bot: Use curl with user-agent overrides.
- Check server logs: Confirm 403/404/410 for blocked agents.
- Inspect robots.txt: Ensure valid syntax and no conflicts.
- Edge analytics: Watch for drops in unwanted bot traffic after rules go live.
# Quick curl tests
curl -I -A "GPTBot" https://yourdomain.com/
curl -I -A "CCBot" https://yourdomain.com/
curl -I -A "ClaudeBot" https://yourdomain.com/
Look for 403 responses for blocked agents. For assets, confirm the X-Robots-Tag header is present.
Monitor logs and identify stealth scrapers
Some crawlers mask their identity. Use log patterns and traffic analysis to spot them:
- Volume anomalies: Sudden spikes from a small set of IPs or ASNs.
- Deep crawl patterns: Systematic traversal of pagination, archives, and feeds.
- Missing critical resources: Headless fetches skipping CSS/JS/images.
- Low cookie rate: Bots often reject cookies and don’t execute JS.
Respond with adaptive controls: rate limiting, challenge pages, session-based throttling, or temporarily blacklisting abusive IPs and ASNs.
Images, video, APIs, and feeds: don’t forget non-HTML surfaces
AI training doesn’t stop at HTML. Images, transcripts, PDFs, public APIs, and RSS feeds are high-value targets. Apply policies consistently:
- CDN rules for image/video paths (e.g., /media/, /uploads/, /cdn/).
- X-Robots-Tag headers on non-HTML responses: “noai, noimageai”.
- Rate limits for feed/API endpoints.
- Token/auth required for high-value API endpoints; never rely on robots.txt alone.
Build exceptions and allowlists
Most organizations need nuance—not a blanket block. Consider:
- Allowlist strategic partners by IP or signed requests.
- Permit search engines that drive revenue.
- Allow specific directories (e.g., press kit) while blocking the rest.
# robots.txt example: block by default, allow a directory
User-agent: GPTBot
Disallow: /
Allow: /press/
User-agent: CCBot
Disallow: /
Legal, ethical, and policy context
This guide is not legal advice, but some considerations can inform your strategy:
- Terms of Service: Clearly state permitted uses and AI training restrictions for your content.
- Jurisdiction: Copyright, database rights, and text/data mining exceptions vary by country.
- Attribution and licensing: If you publish under a license, consider AI-specific clauses.
- Transparency: Publicly documenting your AI policy increases compliance and goodwill.
Blocking is one lever; clear policy is another. Many AI providers aim to respect site owners’ preferences when articulated via robots.txt, headers, and ToS.
SEO and performance side effects (and how to avoid them)
Blocking AI crawlers should not damage your organic search performance if you:
- Do not block Googlebot, Bingbot, and other legit search bots.
- Scope rules precisely to specific AI agents, not generic patterns like “bot”.
- Monitor crawl stats in search webmaster tools to confirm steady indexing.
- Protect site speed by offloading block logic to your CDN/WAF where possible.
If you accidentally block a search engine, fix robots.txt and edge rules immediately and request recrawls via search tooling.
Reference table: major AI crawlers and how to block them
Use this table as a practical reference while you implement policy. Confirm each provider’s latest documentation, as user-agent tokens and practices can evolve.
| Crawler / Provider | Primary User-Agent token | Purpose (summary) | Respects robots.txt | Robots.txt block snippet | Notes |
| OpenAI | GPTBot | Model training and retrieval | Yes (per OpenAI) |
|
OpenAI publishes IP ranges; also use headers noai/noimageai if desired. |
| OpenAI Search | OAI-SearchBot | Search and retrieval | Yes (per OpenAI) |
|
Block if you do not want inclusion in AI search experiences. |
| ChatGPT browser fetcher | ChatGPT-User | On-demand page fetching | Yes |
|
Blocks fetches initiated during chat browsing. |
| Common Crawl | CCBot | Open web crawl for datasets | Yes |
|
Upstream for many AI datasets; well-known, respects robots.txt. |
| Anthropic | ClaudeBot; anthropic-ai | Model training and retrieval | Yes (per Anthropic) |
|
Two tokens observed/documented; use both. |
| Perplexity | PerplexityBot | AI answer engine | Claims yes |
|
Consider WAF enforcement for reliability. |
| Google (Generative) | Google-Extended | Control for Gemini/Vertex AI usage | Yes (per Google) |
|
Does not affect Googlebot search crawling. |
| Apple (Generative) | Applebot-Extended | Control for Apple AI training | Yes (per Apple) |
|
Separate from Applebot for search/Siri. |
| Meta (Facebook) | Meta-External; FacebookBot | Model training / previews | Yes (per Meta) |
|
Block one or both depending on policy. |
| ByteDance | Bytespider | Content collection | Varies |
|
Use WAF enforcement if heavy crawling persists. |
| Diffbot (AI extraction) | Diffbot | Structured extraction/knowledge graph | Yes |
|
Some teams allow Diffbot; decide per use case. |
Statistics note: As of the Imperva 2024 Bad Bot Report, bots (good and bad) comprised nearly half of web traffic, underscoring the importance of a layered approach.
Enterprise workflow: governance for large websites
At scale, blocking AI crawlers is a cross-functional initiative. A practical governance model includes:
- Policy owners: Product/legal define the default posture (opt-out vs. selective allow).
- Technical stewards: DevOps/SRE own edge controls and server configs.
- Content stakeholders: Marketing and editorial identify exceptions or syndication partners.
- Security: Monitors for evasion and escalates enforcement.
- Change control: Version robots.txt, track rules in Git, and use CI for syntax checks.
Cadence example:
- Quarterly policy review (legal + product).
- Monthly crawler inventory (analytics + security).
- Rolling updates to blocklists and IPs at the edge.
- Post-deployment verification and reporting to stakeholders.
Troubleshooting: common mistakes and how to fix them
- Blocking good bots: Symptom is sudden SEO traffic drop. Fix by removing search engine tokens from blocks and confirm with reverse DNS validation.
- Relying only on robots.txt: Some crawlers may ignore it. Add WAF rules and server blocks.
- Incorrect robots syntax: Ensure “User-agent” and “Disallow” lines are properly formatted, with no stray characters.
- Forgetting non-HTML assets: Add X-Robots-Tag and edge rules for images, PDFs, and APIs.
- Not testing: Always curl with user-agent overrides and confirm status codes.
Step-by-step implementation guide
Here’s a straightforward rollout plan from the Watsspace team:
- Set your policy: Decide default stance (block most AI crawlers; allow only selected ones).
- Update robots.txt: Add blocks for GPTBot, CCBot, ClaudeBot, anthropic-ai, PerplexityBot, Google-Extended, Applebot-Extended, Meta-External, FacebookBot, Bytespider.
- Add headers: X-Robots-Tag: “noai, noimageai” for HTML and non-HTML responses.
- Deploy edge rules: WAF blocklist by user-agent; add rate limits for unknown UAs; consider IP-based rules.
- Server fallback: Implement Apache/Nginx rules to return 403 for targeted user-agents.
- Test: curl with spoofed UAs; verify 403s and headers.
- Monitor: Track bot traffic share and bandwidth; update as user-agents evolve.
- Document: Publish your AI policy page and keep internal runbooks updated.
Advanced tactics for persistent AI scrapers
If you’re dealing with aggressive or non-compliant crawlers:
- ASN blocking: Deny traffic from ASNs associated with scraping infrastructure.
- Session-based throttling: Require cookies and throttle sessions that don’t maintain state.
- Dynamic robots.txt: Serve stricter policies to suspicious IPs while maintaining a standard public file.
- Honeypot URLs: Invisible to humans; if hit, automatically flag and block the source.
- Reverse DNS verification: For allowlisted bots (Googlebot/Bingbot), verify via rDNS to avoid spoofing.
# Example: verify Googlebot (conceptual)
# 1) Resolve request IP to hostname. Should end with googlebot.com or google.com
# 2) Forward-lookup the hostname to confirm it maps back to the same IP
# If both checks pass, treat as genuine Googlebot
Testing playbook: don’t ship blind
Before pushing global changes:
- Stage robots.txt and WAF rules in a non-production environment with mirrored traffic if possible.
- Automate curl tests in CI to validate expected 403/200 responses by user-agent.
- Guardrails: Add alerts for sudden drops in Googlebot/Bingbot hits.
- Rollback plan: Keep prior configs ready to restore within minutes.
Frequently asked questions
Will blocking AI crawlers hurt my SEO?
Not if you’re careful. Block AI-specific agents and product tokens, not general-purpose search bots. Use verified allowlists for Googlebot and Bingbot.
Is robots.txt enough to stop AI training?
No. It’s a strong preference signal most reputable providers follow. For actual enforcement, add WAF rules, server-level blocks, and rate limits.
What about images and PDFs?
Apply X-Robots-Tag: noai, noimageai to non-HTML responses and consider CDN path-based rules. Many AI systems crawl images and document files aggressively.
Should I block Common Crawl?
Many organizations do if they prefer not to appear in open training datasets. Others allow it for research value. It’s a policy choice—robots.txt makes it easy to toggle.
Can AI crawlers spoof their identity?
Yes. That’s why edge and server enforcement, plus log monitoring, are critical. Robots.txt alone is advisory.
How often should I update my blocklist?
Quarterly is a good baseline, sooner if you detect new actors. Subscribe to provider announcements and watch your logs for unknown agents.
Copy-and-paste policy kit
Use these templates as a starting point.
Robots.txt baseline
# AI training and answer-engine crawlers: blocked
User-agent: GPTBot
Disallow: /
User-agent: OAI-SearchBot
Disallow: /
User-agent: ChatGPT-User
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: anthropic-ai
Disallow: /
User-agent: PerplexityBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: Applebot-Extended
Disallow: /
User-agent: Meta-External
Disallow: /
User-agent: FacebookBot
Disallow: /
User-agent: Bytespider
Disallow: /
# Search bots: allowed
User-agent: Googlebot
Disallow:
User-agent: Bingbot
Disallow:
User-agent: DuckDuckBot
Disallow:
Apache headers and blocks
# Add AI policy headers for all content
Header set X-Robots-Tag "noai, noimageai"
# Deny AI user-agents
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} (GPTBot|OAI-SearchBot|ChatGPT-User|CCBot|ClaudeBot|anthropic-ai|PerplexityBot|Google-Extended|Applebot-Extended|Meta-External|FacebookBot|Bytespider) [NC]
RewriteRule ^ - [F]
Nginx headers and blocks
add_header X-Robots-Tag "noai, noimageai" always;
map $http_user_agent $block_ai {
default 0;
"~*GPTBot" 1;
"~*OAI-SearchBot" 1;
"~*ChatGPT-User" 1;
"~*CCBot" 1;
"~*ClaudeBot" 1;
"~*anthropic-ai" 1;
"~*PerplexityBot" 1;
"~*Google-Extended" 1;
"~*Applebot-Extended" 1;
"~*Meta-External" 1;
"~*FacebookBot" 1;
"~*Bytespider" 1;
}
server {
if ($block_ai) { return 403; }
# remaining config...
}
Cloudflare (conceptual) WAF rules
(http.user_agent contains "GPTBot") or
(http.user_agent contains "OAI-SearchBot") or
(http.user_agent contains "ChatGPT-User") or
(http.user_agent contains "CCBot") or
(http.user_agent contains "ClaudeBot") or
(http.user_agent contains "anthropic-ai") or
(http.user_agent contains "PerplexityBot") or
(http.user_agent contains "Google-Extended") or
(http.user_agent contains "Applebot-Extended") or
(http.user_agent contains "Meta-External") or
(http.user_agent contains "FacebookBot") or
(http.user_agent contains "Bytespider")
-> Block
# Rate limit unknown high-frequency agents
(not http.user_agent in {"Googlebot","Bingbot","DuckDuckBot"}) and
(ip.src.hit_count > 120 over 60 seconds)
-> Rate Limit (429)
Measurement: prove your policy works
To demonstrate impact to stakeholders, build a simple measurement framework:
- Baseline: Capture 30 days of pre-policy bot traffic by user-agent and ASN.
- KPIs: Percent of traffic from AI bots; bandwidth consumed; origin CPU time; 403 counts.
- SEO guardrails: Track impressions and clicks from search consoles to ensure no negative impact.
- Post-change deltas: Compare weekly after rollout; annotate dashboards.
Look for reductions in unwanted bot bandwidth and origin load. Share wins with legal and leadership to validate the program.
Strategic perspective: block, allow, or license?
Blocking AI crawlers is one option on a continuum:
- Block by default: Maximum control, lowest risk of unintended use.
- Selective allow: Permit certain crawlers for discovery or research value.
- License: Offer structured access (feeds, APIs) under contract for revenue and attribution.
Many publishers start with blocking while evaluating partnership models. The key is to choose a deliberate posture rather than defaulting into AI training by inaction.
Authoritative sources and market signals
While we cannot link here, the following sources inform best practices and highlight the scale of the problem:
- Imperva 2024 Bad Bot Report: Documents that 49.6% of traffic was non-human in 2023 and bad bots hit a record 32%.
- Common Crawl: Describes monthly crawls indexing billions of pages that power research and training datasets.
- Pew Research Center, 2024: Reports rising public concern about AI and its impacts.
- Provider documentation from OpenAI (GPTBot, OAI-SearchBot), Google (Google-Extended), Apple (Applebot-Extended), Anthropic (ClaudeBot/anthropic-ai), Meta (Meta-External/FacebookBot), Perplexity (PerplexityBot).
Combine these insights with your own analytics to tailor a policy that fits your risk, brand, and monetization strategy.
Implementation checklist
- Policy: Decide default stance and exceptions.
- robots.txt: Add targeted blocks for AI user-agents and product tokens.
- Headers: Send X-Robots-Tag: “noai, noimageai” for HTML and assets.
- WAF/CDN: Block by user-agent; add rate limits; consider IP/ASN controls.
- Server: Configure Apache/Nginx 403 rules as backup.
- Assets: Extend controls to images, PDFs, APIs, and feeds.
- Allowlists: Keep Googlebot/Bingbot/DuckDuckBot unblocked; verify via rDNS.
- Testing: Curl with AI user-agents; validate headers and status codes.
- Monitoring: Track bot traffic share, bandwidth, and 403s; watch SEO metrics.
- Governance: Version control, change logs, periodic reviews, and owner accountability.
Bottom line from Watsspace: The most effective way to block AI web crawlers is a layered approach—declare your policy (robots.txt, headers), enforce it (WAF and server rules), verify continuously (testing and logs), and govern it (policy owners and change control). This gives you control without sacrificing SEO or performance.
Watsspace Digital Marketing Blog