LLMS Central - The Robots.txt for AI
October 7, 2025 • 9 min read • Reference Guide

Complete Guide to AI Bot User-Agents

The definitive reference for identifying, tracking, and managing 20+ AI crawlers visiting your website.

Quick Reference

This guide covers all major AI bot user-agents as of October 2025. Bookmark this page as your go-to reference for AI crawler identification.

20+ BotsDetection CodeBlocking Methods

Understanding User-Agent Strings

Every bot that visits your website identifies itself through a user-agent string. This string tells you what software is accessing your site, allowing you to track, analyze, and control AI crawler access.

User-Agent: Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.0; +https://openai.com/gptbot)

AI bots typically include their name and a link to documentation in their user-agent string, making them identifiable in server logs and analytics tools.

Major AI Bot User-Agents

GPTBot

OpenAI • ChatGPT, GPT-4, GPT-3.5

High Traffic
GPTBot/1.0 (+https://openai.com/gptbot)

Purpose: Training ChatGPT and GPT models

Respects llms.txt: ✅ Yes (94% compliance)

Documentation: openai.com/gptbot

Block in robots.txt: User-agent: GPTBot / Disallow: /

Claude-Web

Anthropic • Claude AI

High Traffic
Claude-Web/1.0 (+https://www.anthropic.com/bot)

Purpose: Training Claude language models

Respects llms.txt: ✅ Yes (91% compliance)

Documentation: anthropic.com/bot

Block in robots.txt: User-agent: Claude-Web / Disallow: /

Google-Extended

Google • Gemini, Bard

High Traffic
Google-Extended/1.0 (+https://developers.google.com/search/docs/crawling-indexing/overview-google-crawlers)

Purpose: Training Gemini and improving AI products (separate from search indexing)

Respects llms.txt: ✅ Yes (89% compliance)

Documentation: Google Crawlers

Block in robots.txt: User-agent: Google-Extended / Disallow: /

CCBot

Common Crawl • Multiple AI Companies

Very High Traffic
CCBot/2.0 (https://commoncrawl.org/faq/)

Purpose: Building web archive used by many AI companies for training

Respects llms.txt: ⚠️ Partial (67% compliance)

Documentation: commoncrawl.org

Block in robots.txt: User-agent: CCBot / Disallow: /

PerplexityBot

Perplexity AI • AI Search Engine

Medium Traffic
PerplexityBot/1.0 (+https://perplexity.ai/bot)

Purpose: Real-time search and answer generation

Respects llms.txt: ✅ Yes

Documentation: perplexity.ai/bot

Block in robots.txt: User-agent: PerplexityBot / Disallow: /

Bytespider

ByteDance • TikTok AI

High Traffic
Bytespider/1.0 (+https://bytedance.com/)

Purpose: Training AI models for TikTok and ByteDance products

Respects llms.txt: ⚠️ Unknown

Documentation: Limited public documentation

Block in robots.txt: User-agent: Bytespider / Disallow: /

Applebot-Extended

Apple • Apple Intelligence

Medium Traffic
Applebot-Extended/1.0 (+https://support.apple.com/en-us/119829)

Purpose: Training Apple Intelligence and AI features

Respects llms.txt: ✅ Yes

Documentation: Apple Support

Block in robots.txt: User-agent: Applebot-Extended / Disallow: /

Additional AI Crawlers

Beyond the major players, numerous other AI bots crawl the web. Here's a comprehensive list:

Search & Answer Engines

  • YouBot - You.com AI search
  • Diffbot - Knowledge graph extraction
  • Omgilibot - Omgili search crawler
  • FacebookBot - Meta AI training

Research & Academic

  • anthropic-ai - Anthropic research
  • cohere-ai - Cohere AI models
  • AI2Bot - Allen Institute for AI
  • Scrapy - Research data collection

Commercial AI Services

  • ImagesiftBot - Image AI training
  • Amazonbot - Amazon AI services
  • Kangaroo Bot - AI data collection
  • Timpibot - AI search indexing

Emerging AI Bots

  • ChatGPT-User - ChatGPT browsing
  • ClaudeBot - Claude web access
  • Grok-bot - X (Twitter) Grok AI
  • Meta-ExternalAgent - Meta AI crawling

Detection Methods

Server-Side Detection (Recommended)

The most reliable method is checking user-agent strings in your server logs or application code:

Node.js / Express

app.use((req, res, next) => {
  const userAgent = req.headers['user-agent'] || '';
  
  const aiBot = detectAIBot(userAgent);
  if (aiBot) {
    console.log(`AI Bot detected: ${aiBot}`);
    // Log to analytics, apply rate limiting, etc.
  }
  next();
});

function detectAIBot(userAgent) {
  const aiBots = [
    'GPTBot', 'Claude-Web', 'Google-Extended', 
    'CCBot', 'PerplexityBot', 'Bytespider',
    'Applebot-Extended', 'anthropic-ai', 'cohere-ai'
  ];
  
  for (const bot of aiBots) {
    if (userAgent.includes(bot)) {
      return bot;
    }
  }
  return null;
}

Python / Flask

from flask import request

AI_BOTS = [
    'GPTBot', 'Claude-Web', 'Google-Extended',
    'CCBot', 'PerplexityBot', 'Bytespider'
]

@app.before_request
def detect_ai_bot():
    user_agent = request.headers.get('User-Agent', '')
    
    for bot in AI_BOTS:
        if bot in user_agent:
            # Log detection
            app.logger.info(f'AI Bot detected: {bot}')
            # Add to analytics
            track_ai_bot(bot)
            break

PHP

<?php
$userAgent = $_SERVER['HTTP_USER_AGENT'] ?? '';

$aiBots = [
    'GPTBot', 'Claude-Web', 'Google-Extended',
    'CCBot', 'PerplexityBot', 'Bytespider'
];

foreach ($aiBots as $bot) {
    if (strpos($userAgent, $bot) !== false) {
        error_log("AI Bot detected: " . $bot);
        // Track in analytics
        trackAIBot($bot);
        break;
    }
}
?>

Analytics Integration

Track AI bot visits in Google Analytics or your analytics platform:

// Google Analytics 4
gtag('event', 'ai_bot_visit', {
  'bot_name': botName,
  'page_path': window.location.pathname,
  'timestamp': new Date().toISOString()
});

// Custom Analytics
analytics.track('AI Bot Visit', {
  botName: botName,
  userAgent: navigator.userAgent,
  page: window.location.href
});

Blocking Strategies

Method 1: robots.txt (Simple)

Block all AI bots at once in your robots.txt file:

# Block major AI training bots
User-agent: GPTBot
Disallow: /

User-agent: Claude-Web
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: PerplexityBot
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: Applebot-Extended
Disallow: /

Method 2: llms.txt (Granular Control)

Use llms.txt for selective policies:

# llms.txt - Selective AI Policy

# Allow blog content
User-agent: *
Allow: /blog/
Allow: /docs/

# Block everything else
Disallow: /admin/
Disallow: /user/
Disallow: /premium/

# Specific rules for GPTBot
User-agent: GPTBot
Allow: /
Disallow: /private/
Crawl-delay: 2

Method 3: Server-Level Blocking

Block at the server level for guaranteed enforcement:

# Apache .htaccess
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} (GPTBot|Claude-Web|CCBot) [NC]
RewriteRule .* - [F,L]

# Nginx
if ($http_user_agent ~* (GPTBot|Claude-Web|CCBot)) {
    return 403;
}

Monitoring AI Bot Activity

Server Log Analysis

Analyze your server logs to see which AI bots are visiting:

# Count AI bot visits in Apache/Nginx logs
grep -E "(GPTBot|Claude-Web|Google-Extended|CCBot)" access.log | wc -l

# See which bots visited
grep -E "(GPTBot|Claude-Web|Google-Extended|CCBot)" access.log | \
  awk '{print $1, $12}' | sort | uniq -c

# Track by date
grep "GPTBot" access.log | awk '{print $4}' | cut -d: -f1 | \
  sort | uniq -c

Real-Time Tracking

Use our free AI bot tracker for real-time monitoring:

Free AI Bot Tracker

See which AI bots visit your site in real-time with our invisible tracking widget. Tracks 20+ AI crawlers automatically.

Get Free Tracker →

Best Practices

✅ DO: Monitor Before Blocking

Track AI bot activity for 2-4 weeks before implementing blocking policies. Understand which bots visit and how often.

✅ DO: Use llms.txt for Granular Control

Implement selective policies that allow public content while protecting sensitive areas.

✅ DO: Document Your Policy

Include comments in your llms.txt explaining your reasoning and contact information.

❌ DON'T: Block Without Understanding Impact

Blocking all AI bots may reduce your visibility in AI-powered search results.

❌ DON'T: Forget to Update

New AI bots emerge regularly. Review and update your policies quarterly.

Quick Reference Table

Bot NameCompanyRespects llms.txtTraffic Level
GPTBotOpenAI✅ 94%High
Claude-WebAnthropic✅ 91%High
Google-ExtendedGoogle✅ 89%High
CCBotCommon Crawl⚠️ 67%Very High
PerplexityBotPerplexity✅ YesMedium
BytespiderByteDance❓ UnknownHigh

Manage AI Bots Effectively

Use our free tools to detect, track, and control AI crawler access to your website:

📚Related Articles

Published on October 7, 2025 by LLMS Central Team • Updated regularly