Complete Guide to AI Bot User-Agents: GPTBot, Claude, Gemini & 20+ More

Understanding User-Agent Strings

Every bot that visits your website identifies itself through a user-agent string. This string tells you what software is accessing your site, allowing you to track, analyze, and control AI crawler access.

User-Agent: Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.0; +https://openai.com/gptbot)

AI bots typically include their name and a link to documentation in their user-agent string, making them identifiable in server logs and analytics tools.

Major AI Bot User-Agents

GPTBot

OpenAI • ChatGPT, GPT-4, GPT-3.5

High Traffic

GPTBot/1.0 (+https://openai.com/gptbot)

Purpose: Training ChatGPT and GPT models

Respects llms.txt: ✅ Yes (94% compliance)

Documentation: openai.com/gptbot

Block in robots.txt: User-agent: GPTBot / Disallow: /

Claude-Web

Anthropic • Claude AI

High Traffic

Claude-Web/1.0 (+https://www.anthropic.com/bot)

Purpose: Training Claude language models

Respects llms.txt: ✅ Yes (91% compliance)

Documentation: anthropic.com/bot

Block in robots.txt: User-agent: Claude-Web / Disallow: /

Google-Extended

Google • Gemini, Bard

High Traffic

Google-Extended/1.0 (+https://developers.google.com/search/docs/crawling-indexing/overview-google-crawlers)

Purpose: Training Gemini and improving AI products (separate from search indexing)

Respects llms.txt: ✅ Yes (89% compliance)

Documentation: Google Crawlers

Block in robots.txt: User-agent: Google-Extended / Disallow: /

CCBot

Common Crawl • Multiple AI Companies

Very High Traffic

CCBot/2.0 (https://commoncrawl.org/faq/)

Purpose: Building web archive used by many AI companies for training

Respects llms.txt: ⚠️ Partial (67% compliance)

Documentation: commoncrawl.org

Block in robots.txt: User-agent: CCBot / Disallow: /

PerplexityBot

Perplexity AI • AI Search Engine

Medium Traffic

PerplexityBot/1.0 (+https://perplexity.ai/bot)

Purpose: Real-time search and answer generation

Respects llms.txt: ✅ Yes

Documentation: perplexity.ai/bot

Block in robots.txt: User-agent: PerplexityBot / Disallow: /

Bytespider

ByteDance • TikTok AI

High Traffic

Bytespider/1.0 (+https://bytedance.com/)

Purpose: Training AI models for TikTok and ByteDance products

Respects llms.txt: ⚠️ Unknown

Documentation: Limited public documentation

Block in robots.txt: User-agent: Bytespider / Disallow: /

Applebot-Extended

Apple • Apple Intelligence

Medium Traffic

Applebot-Extended/1.0 (+https://support.apple.com/en-us/119829)

Purpose: Training Apple Intelligence and AI features

Respects llms.txt: ✅ Yes

Documentation: Apple Support

Block in robots.txt: User-agent: Applebot-Extended / Disallow: /

Additional AI Crawlers

Beyond the major players, numerous other AI bots crawl the web. Here's a comprehensive list:

Search & Answer Engines

YouBot - You.com AI search
Diffbot - Knowledge graph extraction
Omgilibot - Omgili search crawler
FacebookBot - Meta AI training

Research & Academic

anthropic-ai - Anthropic research
cohere-ai - Cohere AI models
AI2Bot - Allen Institute for AI
Scrapy - Research data collection

Commercial AI Services

ImagesiftBot - Image AI training
Amazonbot - Amazon AI services
Kangaroo Bot - AI data collection
Timpibot - AI search indexing

Emerging AI Bots

ChatGPT-User - ChatGPT browsing
ClaudeBot - Claude web access
Grok-bot - X (Twitter) Grok AI
Meta-ExternalAgent - Meta AI crawling

Detection Methods

Server-Side Detection (Recommended)

The most reliable method is checking user-agent strings in your server logs or application code:

Node.js / Express

app.use((req, res, next) => {
  const userAgent = req.headers['user-agent'] || '';
  
  const aiBot = detectAIBot(userAgent);
  if (aiBot) {
    console.log(`AI Bot detected: ${aiBot}`);
    // Log to analytics, apply rate limiting, etc.
  }
  next();
});

function detectAIBot(userAgent) {
  const aiBots = [
    'GPTBot', 'Claude-Web', 'Google-Extended', 
    'CCBot', 'PerplexityBot', 'Bytespider',
    'Applebot-Extended', 'anthropic-ai', 'cohere-ai'
  ];
  
  for (const bot of aiBots) {
    if (userAgent.includes(bot)) {
      return bot;
    }
  }
  return null;
}

Python / Flask

from flask import request

AI_BOTS = [
    'GPTBot', 'Claude-Web', 'Google-Extended',
    'CCBot', 'PerplexityBot', 'Bytespider'
]

@app.before_request
def detect_ai_bot():
    user_agent = request.headers.get('User-Agent', '')
    
    for bot in AI_BOTS:
        if bot in user_agent:
            # Log detection
            app.logger.info(f'AI Bot detected: {bot}')
            # Add to analytics
            track_ai_bot(bot)
            break

PHP

<?php
$userAgent = $_SERVER['HTTP_USER_AGENT'] ?? '';

$aiBots = [
    'GPTBot', 'Claude-Web', 'Google-Extended',
    'CCBot', 'PerplexityBot', 'Bytespider'
];

foreach ($aiBots as $bot) {
    if (strpos($userAgent, $bot) !== false) {
        error_log("AI Bot detected: " . $bot);
        // Track in analytics
        trackAIBot($bot);
        break;
    }
}
?>

Analytics Integration

Track AI bot visits in Google Analytics or your analytics platform:

// Google Analytics 4
gtag('event', 'ai_bot_visit', {
  'bot_name': botName,
  'page_path': window.location.pathname,
  'timestamp': new Date().toISOString()
});

// Custom Analytics
analytics.track('AI Bot Visit', {
  botName: botName,
  userAgent: navigator.userAgent,
  page: window.location.href
});

Blocking Strategies

Method 1: robots.txt (Simple)

Block all AI bots at once in your robots.txt file:

# Block major AI training bots
User-agent: GPTBot
Disallow: /

User-agent: Claude-Web
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: PerplexityBot
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: Applebot-Extended
Disallow: /

Method 2: llms.txt (Granular Control)

Use llms.txt for selective policies:

# llms.txt - Selective AI Policy

# Allow blog content
User-agent: *
Allow: /blog/
Allow: /docs/

# Block everything else
Disallow: /admin/
Disallow: /user/
Disallow: /premium/

# Specific rules for GPTBot
User-agent: GPTBot
Allow: /
Disallow: /private/
Crawl-delay: 2

Method 3: Server-Level Blocking

Block at the server level for guaranteed enforcement:

# Apache .htaccess
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} (GPTBot|Claude-Web|CCBot) [NC]
RewriteRule .* - [F,L]

# Nginx
if ($http_user_agent ~* (GPTBot|Claude-Web|CCBot)) {
    return 403;
}

Monitoring AI Bot Activity

Server Log Analysis

Analyze your server logs to see which AI bots are visiting:

# Count AI bot visits in Apache/Nginx logs
grep -E "(GPTBot|Claude-Web|Google-Extended|CCBot)" access.log | wc -l

# See which bots visited
grep -E "(GPTBot|Claude-Web|Google-Extended|CCBot)" access.log | \
  awk '{print $1, $12}' | sort | uniq -c

# Track by date
grep "GPTBot" access.log | awk '{print $4}' | cut -d: -f1 | \
  sort | uniq -c

Real-Time Tracking

Use our free AI bot tracker for real-time monitoring:

Free AI Bot Tracker

See which AI bots visit your site in real-time with our invisible tracking widget. Tracks 20+ AI crawlers automatically.

Get Free Tracker →

Best Practices

✅ DO: Monitor Before Blocking

Track AI bot activity for 2-4 weeks before implementing blocking policies. Understand which bots visit and how often.

✅ DO: Use llms.txt for Granular Control

Implement selective policies that allow public content while protecting sensitive areas.

✅ DO: Document Your Policy

Include comments in your llms.txt explaining your reasoning and contact information.

❌ DON'T: Block Without Understanding Impact

Blocking all AI bots may reduce your visibility in AI-powered search results.

❌ DON'T: Forget to Update

New AI bots emerge regularly. Review and update your policies quarterly.

Quick Reference Table

Bot Name	Company	Respects llms.txt	Traffic Level
GPTBot	OpenAI	✅ 94%	High
Claude-Web	Anthropic	✅ 91%	High
Google-Extended	Google	✅ 89%	High
CCBot	Common Crawl	⚠️ 67%	Very High
PerplexityBot	Perplexity	✅ Yes	Medium
Bytespider	ByteDance	❓ Unknown	High

Manage AI Bots Effectively

Use our free tools to detect, track, and control AI crawler access to your website:

Track AI Bots Create llms.txt Policy

Quick Reference

Understanding User-Agent Strings

Major AI Bot User-Agents

GPTBot

Claude-Web

Google-Extended

CCBot

PerplexityBot

Bytespider

Applebot-Extended

Additional AI Crawlers

Search & Answer Engines

Research & Academic

Commercial AI Services

Emerging AI Bots

Detection Methods

Server-Side Detection (Recommended)

Node.js / Express

Python / Flask

PHP

Analytics Integration

Blocking Strategies

Method 1: robots.txt (Simple)

Method 2: llms.txt (Granular Control)

Method 3: Server-Level Blocking

Monitoring AI Bot Activity

Server Log Analysis

Real-Time Tracking

Free AI Bot Tracker

Best Practices

✅ DO: Monitor Before Blocking

✅ DO: Use llms.txt for Granular Control

✅ DO: Document Your Policy

❌ DON'T: Block Without Understanding Impact

❌ DON'T: Forget to Update

Quick Reference Table

Manage AI Bots Effectively

📚Related Articles

Introducing AI Bot Analytics: Track Which AI Models Visit Your Website

AI Crawlers Guide

How to Install Bot Tracker