Unwanted AI crawlers can quietly devour server resources, slow down your pages, and waste bandwidth that should be serving real users. If you’ve noticed unexplained traffic spikes, elevated CDN bills, or crawl patterns that don’t correlate with search performance, there’s a good chance automation—not humans—is the culprit. In this guide, the Watsspace Digital Marketing team breaks down exactly how to identify and block AI crawlers that slow down your website, limit exposure of your content to model training, and preserve your budget without risking your SEO.

What Are AI Crawlers and Why Do They Impact Performance?

AI crawlers are automated agents that fetch web pages to train large language models (LLMs), power AI assistants, or support AI-enhanced search experiences. Unlike search engine crawlers that index your site to drive user traffic, many AI crawlers are focused on data extraction rather than discoverability. While some are respectful and transparent, others crawl aggressively or ignore best practices, leading to:

Server strain: Concurrent requests create CPU and I/O pressure, spiking response times.
Bandwidth waste: Repeatedly fetching heavy assets (images, JS bundles) increases CDN and origin egress costs.
Opportunity cost: Bots can consume your crawl budget and saturate connection pools, slowing down legitimate users and search engines.
Content control risks: Scraped content may be used to train models without attribution or benefit to your brand.

Independent industry research indicates that a substantial share of web traffic is automated, and a significant portion of that traffic is “bad” or unwanted.

Imperva, Bad Bot Report 2024

How to Tell If AI Crawlers Are Hitting Your Site

Before you block anything, confirm whether AI bots are materially affecting your site. Use a structured approach:

1) Inspect logs for telltale user agents

Look for distinct UA tokens such as GPTBot, ClaudeBot, CCBot, PerplexityBot, Google-Extended, Applebot-Extended, Amazonbot, Bytespider, omgili/omgilibot, and others.
Sort by requests per UA and by unique IPs to gauge crawl intensity and distribution.
Flag UAs with low request entropy (same paths repeatedly), which often indicates scraping.

2) Measure bandwidth and origin impact

Check egress by UA on your CDN and origin logs.
Correlate spikes in TTFB, queue times, and 5xx/429 with crawler surges.
Identify heavy endpoints (e.g., image galleries, feeds, search endpoints) being hammered.

3) Track request patterns

Note crawl cadence (e.g., bursts every few seconds vs. steady trickle).
Look for non-compliance with robots.txt or direct hits to disallowed paths.
Check for headless fetches that skip HTML and pull JSON APIs directly.

4) Verify the crawler is genuine

Some actors spoof user agents. Validate via reverse DNS to the operator’s domain and forward-confirm the IP. Many reputable crawlers publish IP ranges or hostnames for verification.

Key AI Crawlers You May Want to Control

These are frequently observed AI-related crawlers and tokens as of 2024–2025. Not all are harmful, and many honor robots.txt. Your policy should reflect your goals.

Crawler / UA Token	Operator	Primary Purpose	Respects robots.txt	Verification Tip	Basic robots.txt Block
GPTBot	OpenAI	LLM training and retrieval	Yes (per OpenAI documentation)	Reverse DNS to openai.com and forward-confirm	User-agent: GPTBot Disallow: /
OAI-SearchBot / ChatGPT-User	OpenAI	ChatGPT browsing/search features	Yes (per OpenAI documentation)	RDNS to openai.com	User-agent: OAI-SearchBot Disallow: /
ClaudeBot / anthropic-ai	Anthropic	LLM training and web access	Yes (per Anthropic documentation)	RDNS to anthropic.com	User-agent: ClaudeBot Disallow: /
PerplexityBot / PPLX-*	Perplexity AI	Answer engine content retrieval	Yes (per Perplexity statements)	RDNS to perplexity.ai	User-agent: PerplexityBot Disallow: /
CCBot	Common Crawl	Open corpus for research/LLM training	Yes (per Common Crawl)	RDNS to commoncrawl.org	User-agent: CCBot Disallow: /
Google-Extended	Google	Controls use of content for generative AI (Gemini/Vertex)	Yes (per Google docs)	Verify Google IPs (googlebot.com)	User-agent: Google-Extended Disallow: /
Applebot-Extended	Apple	Controls Apple AI training usage	Yes (per Apple docs)	Verify applebot.apple.com	User-agent: Applebot-Extended Disallow: /
Amazonbot	Amazon	Content for Amazon services/AI	Yes (per Amazon docs)	Verify amazonbot on amazon.com hostnames	User-agent: Amazonbot Disallow: /
Bytespider	ByteDance	Content fetching (may include AI/use in apps)	Mixed reports	RDNS to bytedance-owned domains	User-agent: Bytespider Disallow: /
omgili / omgilibot	Webz.io (Omgili)	Aggregation/data mining	Mixed reports	RDNS to webz.io	User-agent: omgili Disallow: /
ia_archiver	Internet Archive	Wayback Machine snapshots	Yes	RDNS to archive.org	User-agent: ia_archiver Disallow: /

Note: The above is not exhaustive. Always verify current documentation from the operator.

Set a Clear Policy: What You Allow, What You Block

Your decision should align with business goals and risk tolerance:

Allow and monitor: If a crawler demonstrably drives traffic or visibility, let it run with constraints (rate limits, partial allow rules).
Block with robots.txt: Use for groups that publicly commit to honoring it (e.g., GPTBot, CCBot, ClaudeBot, Google-Extended, Applebot-Extended).
Enforce at the edge: For non-compliant or aggressive crawlers, use WAF/CDN filters and server rules to return 403/429.
Segment by path: Allow public article pages but block AI access to downloads, APIs, or member content.
Evaluate legal and contractual obligations: If your content includes licensed assets, stricter controls may be necessary.

Blocking with robots.txt (Your First Line of Defense)

robots.txt is easy to implement and is honored by many reputable AI crawlers. Place at https://yourdomain.com/robots.txt and ensure the file is publicly reachable. Use bot-specific directives:

# Example robots.txt to block a range of AI crawlers
User-agent: GPTBot
Disallow: /

User-agent: OAI-SearchBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: PerplexityBot
Disallow: /

User-agent: PPLX-*
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: Applebot-Extended
Disallow: /

User-agent: Amazonbot
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: omgili
Disallow: /

User-agent: omgilibot
Disallow: /

User-agent: ia_archiver
Disallow: /

# Optional: Set a crawl-delay for bots that honor it (not Google)
User-agent: *
Crawl-delay: 10

Tips for success:

Be explicit: Declare each user agent you care about, even if you also have a wildcard policy.
Order matters: Some bots use the most specific matching block; keep specific agents above general rules.
Test: Use server logs and robots testing tools to ensure directives are visible and syntactically valid.
Remember the limitations: robots.txt is voluntary. Non-compliant scrapers will ignore it.

Strengthen With Meta and Header Controls (Where Supported)

Some organizations discuss or honor emerging controls at the page or header level. Adoption varies, so treat these as supplementary:

<!-- In the <head> of specific pages -->
<meta name="robots" content="noai, noimageai">

# Or send as an HTTP header for non-HTML assets:
X-Robots-Tag: noai, noimageai

Support for noai/noimageai is not universal. Validate operator documentation before relying on these alone.

Enforce at the Edge: WAF and CDN Rules

To block non-compliant or bandwidth-heavy crawlers, use your edge provider’s firewall. Example expressions:

Cloudflare WAF Expression

(http.user_agent contains "GPTBot" or
 http.user_agent contains "OAI-SearchBot" or
 http.user_agent contains "ChatGPT-User" or
 http.user_agent contains "ClaudeBot" or
 http.user_agent contains "anthropic-ai" or
 http.user_agent contains "PerplexityBot" or
 http.user_agent contains "PPLX" or
 http.user_agent contains "CCBot" or
 http.user_agent contains "Google-Extended" or
 http.user_agent contains "Applebot-Extended" or
 http.user_agent contains "Amazonbot" or
 http.user_agent contains "Bytespider" or
 http.user_agent contains "omgili" or
 http.user_agent contains "omgilibot" or
 http.user_agent contains "ia_archiver")

Action options:

Block for outright denial.
Challenge (JavaScript or managed challenge) for suspected bots that might be spoofed.
Rate limit when some access is acceptable but you want to cap bandwidth.

Server-Level Blocking Rules (Apache, Nginx, IIS, Varnish)

When you control the origin, you can block user agents at the web server. Always test in staging before deploying to production.

Apache (.htaccess or vhost)

RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} (GPTBot|OAI-SearchBot|ChatGPT-User|ClaudeBot|anthropic-ai|PerplexityBot|PPLX|CCBot|Google-Extended|Applebot-Extended|Amazonbot|Bytespider|omgili|omgilibot|ia_archiver) [NC]
RewriteRule ^ - [F]

Nginx (server block)

map $http_user_agent $block_ai {
  default 0;
  ~*(gptbot|oai-searchbot|chatgpt-user|claudebot|anthropic-ai|perplexitybot|pplx|ccbot|google-extended|applebot-extended|amazonbot|bytespider|omgili|omgilibot|ia_archiver) 1;
}

server {
  ...
  if ($block_ai) { return 403; }
  ...
}

IIS (URL Rewrite rule snippet)

<rule name="Block AICrawlers" stopProcessing="true">
  <match url="(.*)" />
  <conditions logicalGrouping="MatchAny">
    <add input="{HTTP_USER_AGENT}" pattern="GPTBot|OAI-SearchBot|ChatGPT-User|ClaudeBot|anthropic-ai|PerplexityBot|PPLX|CCBot|Google-Extended|Applebot-Extended|Amazonbot|Bytespider|omgili|omgilibot|ia_archiver" />
  </conditions>
  <action type="CustomResponse" statusCode="403" statusReason="Forbidden" statusDescription="Forbidden"/>
</rule>

Varnish (VCL)

if (req.http.User-Agent ~ "(?i)GPTBot|OAI-SearchBot|ChatGPT-User|ClaudeBot|anthropic-ai|PerplexityBot|PPLX|CCBot|Google-Extended|Applebot-Extended|Amazonbot|Bytespider|omgili|omgilibot|ia_archiver") {
  return (synth(403, "Forbidden"));
}

Caution: UA-based blocks are easy to implement but can be bypassed by spoofing. For high-value properties, augment with IP/rDNS verification and behavioral controls.

Verify Genuine Crawlers (and Avoid Collateral Damage)

Trusted bots often publish verification methods. Use these to avoid accidentally blocking legitimate services that your SEO depends on:

Reverse/forward DNS: Confirm the crawler IP resolves to the operator’s domain and forward-resolves back to the same IP.
Published IP ranges: Some operators share netblocks. Avoid hardcoding unless you can maintain updates.
Header markers: Certain bots include secondary identifiers (but don’t rely on these alone).
Rate and pattern analysis: Real search bots crawl predictably and respect server signals; scrapers often surge or ignore errors.

Example: Google’s bots reverse-resolve to googlebot.com or google.com; OpenAI and Anthropic document verification via their domains.

Rate Limiting and Crawl Budget Controls

Blocking isn’t always all-or-nothing. If a bot provides some value, combine rate limiting with path-level allows:

CDN rate limits: Cap requests per minute per IP and exempt known search engine IPs.
Dynamic tarpit: Slow response times after a threshold to discourage aggressive crawlers.
Path-based rules: Allow only to lightweight, cacheable pages; block heavy endpoints like search, feeds, and APIs.
Asset controls: Serve lower-resolution or cached variants to bots to reduce bandwidth.

Sample Mixed Policy robots.txt

User-agent: *
Disallow: /search
Disallow: /wp-admin/
Disallow: /checkout
Disallow: /api/

# Block LLM/AI training agents
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: PerplexityBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: Applebot-Extended
Disallow: /

User-agent: Amazonbot
Disallow: /

Monitor Results and Prove the ROI

Blocking is only successful if it improves performance or saves costs without harming visibility. Track:

Bandwidth reduction: Compare CDN egress pre/post implementation by user agent and path.
Performance gains: Monitor TTFB and response times during peak hours.
Error rates: Watch for 4xx/5xx changes that might indicate overblocking.
SEO health: Confirm that Googlebot/Bingbot crawl rates and indexation remain stable.

Consider creating a week-over-week dashboard that pairs “blocked requests” with “cost per 1,000 requests” to quantify savings. Even modest reductions in bot traffic can produce significant cost benefits at scale.

Benchmarks and Research to Inform Your Strategy

Bot traffic share: Industry analyses have repeatedly found that automated traffic comprises a substantial portion of web requests, with a meaningful share categorized as “bad” or unwanted. Source: Imperva, Bad Bot Report 2024.
Common Crawl scale: Common Crawl maintains a multi-petabyte corpus and routinely fetches billions of pages per crawl. Source: Common Crawl Foundation.
Robots compliance: OpenAI has publicly stated GPTBot honors robots.txt; Anthropic points to similar controls for ClaudeBot. Sources: OpenAI documentation; Anthropic documentation.

These macro figures underscore why a default-allow posture can be costly—and why clear, enforceable rules are essential.

Protect Dynamic and Private Areas First

Prioritize areas where AI crawlers create the most harm:

Search endpoints: Block /search, /?s=, or custom search APIs.
Account and checkout: Always disallow /account, /checkout, /cart, and authenticated dashboards.
High-churn content: Block endpoints that change very frequently and provide little public value if copied (e.g., inventory APIs).
Media-heavy directories: Limit bots to cached or thumbnail variants to save bandwidth.

Minimize Risk to SEO and Discoverability

You want to block AI crawlers without undermining legitimate search traffic. Follow these safeguards:

Do not block Googlebot/Bingbot unless you fully understand the consequences.
Test in a staging environment: Simulate crawler UAs to verify responses.
Granular rules: Keep AI bot blocks scoped to their UAs/tokens. Avoid “User-agent: * Disallow: /” unless you truly intend global blocking.
Whitelist verified good bots: Use rDNS to ensure you never challenge search engine IPs.

Dealing With Spoofing and Gray-Area Crawlers

Some scrapers masquerade as legitimate agents or random browsers. Mitigation tactics:

Behavioral detection: Identify abnormal fetch rates, headless signatures, or unique header patterns.
Progressive friction: Start with rate limits and challenges; escalate to 403 blocks on repeat offenders.
AS/geo-based controls: Block or throttle specific autonomous systems known for scraping (careful with collateral impact).
Honeypot URLs: Place a disallowed, invisible link and block any UA/IP that requests it.

Be mindful that UA spoofing is common. Combine techniques for robust protection.

Content-Level Strategies to Reduce Crawl Value

Even with blocks, some scraping attempts will reach your content. Reduce their ROI:

Render blocking for bots: Serve simplified HTML for suspicious UAs to minimize payload size.
Selective hydration: Avoid server-side rendering heavy JSON to anonymous requests on sensitive pages.
API tokenization: Require auth or signed URLs for expensive operations and datasets.
Watermarking and attribution: Where relevant, watermark media and reinforce branding to preserve attribution if copied.

Governance: Document and Automate Your Policy

Treat bot management as an ongoing program, not a one-off change:

Maintain a crawler registry: Track UA tokens, last verification date, known IP ranges, and your current policy.
Automate updates: Use infrastructure-as-code (e.g., Terraform for CDN rules) to roll out changes consistently.
Quarterly reviews: Audit logs, update blocklists, and re-evaluate trade-offs as AI ecosystems evolve.
Incident playbooks: Define thresholds for emergency rate limiting and communication steps for spikes.

Legal and Ethical Considerations

This is not legal advice, but consider:

Terms of Service: Explicitly disallow scraping and AI training in your site policies if that aligns with your stance.
Robots.txt signals: Many reputable AI providers commit to honoring these signals; use them to communicate policy.
Privacy: Ensure that bot-blocking measures do not expose user data or logs inadvertently.
Research and archival: If you value inclusion in public archives (e.g., Wayback Machine), tailor your blocks accordingly.

Step-by-Step: A Fast, Practical Playbook

Baseline: Pull 14–30 days of access logs. Quantify requests, bandwidth, and errors by UA and path.
Prioritize: Identify top offenders by egress and performance impact.
Plan: Decide robots-first vs. edge/server enforcement for each crawler.
Implement robots.txt: Add disallows for GPTBot, ClaudeBot, CCBot, PerplexityBot, Google-Extended, Applebot-Extended, Amazonbot, and any others on your list.
Add WAF rules: Block or challenge non-compliant UAs; rate limit bursty patterns.
Harden servers: Deploy Apache/Nginx/IIS rules; protect heavy endpoints and APIs.
Verify: Confirm disallow fetches in logs; validate no impact on Googlebot/Bingbot.
Measure: Track bandwidth, TTFB, and error changes for 1–2 weeks.
Iterate: Refine path-level rules and rate limits; update your registry and documentation.

Advanced Controls: IP Verification and Dynamic Responses

For high-security environments or premium content libraries:

IP allow lists: Only permit search engine IPs and select partners to crawl; serve 403 to others. Risk: maintenance overhead.
Dynamic token checks: Require time-bound tokens for content APIs to prevent blind harvesting.
Geofencing: Limit crawl access to geographies where your legal framework is favorable.
Edge workers: Use serverless functions to score requests (UA + behavior) and alter responses in real time.

Communicating Your Policy to AI Providers

When possible, use official channels to reinforce your stance:

Robots.txt: The canonical, machine-readable declaration.
AI-focused directives: For providers offering “Extended” tokens (e.g., Google-Extended, Applebot-Extended), ensure explicit disallows.
Contact forms: Some organizations accept opt-out requests; document your submissions.
Consistent messaging: Align your site Terms with your robots and headers.

Common Mistakes to Avoid

Overblocking: Accidentally blocking legitimate search engine crawlers due to broad regexes.
Set-and-forget robots: Failing to revisit your policy as the AI crawler landscape evolves.
Ignoring spoofing: Relying only on UA strings without additional checks or rate controls.
Blocking at origin only: Neglecting to implement controls at the CDN edge where they are most cost-effective.

Realistic Outcomes: What Success Looks Like

Within a few weeks of a well-implemented strategy, Watsspace clients typically see:

Bandwidth savings: A measurable reduction in egress associated with blocked UAs and heavy endpoints.
Faster pages: Lower TTFB during peak windows when crawlers previously contended with users.
Stable SEO: Unchanged or improved crawl health from search engines due to freed capacity.
Fewer anomalies: Reduced spike incidents and fewer false alerts in monitoring systems.

FAQ: Blocking AI Crawlers Without Hurting SEO

Will blocking GPTBot or ClaudeBot hurt my Google rankings?

No. Those crawlers are unrelated to Google Search. Keep your rules precise and avoid touching Googlebot or Bingbot.

Is robots.txt enough?

It’s a great starting point for reputable providers, but non-compliant scrapers will ignore it. Combine robots with WAF and server rules.

Should I block Common Crawl?

If you don’t want your content in open datasets often used for AI training, yes—CCBot honors robots.txt. Some publishers allow it for research value; decide based on your content strategy.

What if a bot spoofs a user agent?

Use reverse DNS/IP verification, rate limits, and behavioral detection. Consider challenging or blocking suspicious traffic that fails verification.

What response code should I return?

403 Forbidden is common for policy-denied bots. 429 Too Many Requests fits rate limiting scenarios. Avoid 404 for policy blocks to keep monitoring clear.

Putting It All Together: A Balanced, Durable Strategy

Unwanted AI crawling will continue evolving, but your defenses can stay one step ahead with a layered approach:

Declare your policy in robots.txt and headers where supported.
Enforce it at the edge with WAF rules and at origin with server configs.
Verify genuine bots and whitelist search engines to protect SEO.
Measure savings and performance gains to justify the strategy to stakeholders.
Iterate quarterly as the crawler ecosystem changes.

The goal isn’t to wage war on all automation; it’s to ensure only value-adding bots get your bandwidth. By implementing the practical steps in this guide—backed by authoritative signals from providers like OpenAI, Anthropic, Google, Apple, Amazon, and Common Crawl—you can keep your site fast, your costs predictable, and your content under your control.

Check out other posts

How to Block AI Crawlers That Slow Down Website and Waste Bandwidth