October 7, 2025 • 8 min read • Analysis

I Analyzed 2,000+ llms.txt Files. Here's What Most Websites Get Wrong.

Surprising insights from analyzing thousands of llms.txt implementations—and how to avoid these common mistakes.

Key Finding

67% of llms.txt files contain at least one critical error that undermines their effectiveness. Here's what we found.

The Research

As the maintainers of LLMS Central, we've validated and analyzed over 2,000 llms.txt files from websites across 15 industries. What we discovered was eye-opening: most websites are making preventable mistakes that reduce the effectiveness of their AI training policies.

This article breaks down the 10 most common mistakes, ranked by frequency, with real examples and fixes.

Mistake #1: Wrong File Location

Found in: 23% of implementations

The single most common mistake—and the most damaging.

The Problem

AI crawlers look for llms.txt at your domain root: https://example.com/llms.txt

We found files in wrong locations like:

  • /content/llms.txt
  • /public/llms.txt
  • /assets/llms.txt
  • /wp-content/llms.txt

The Fix

✅ Correct Location

Place your llms.txt file at: https://yourdomain.com/llms.txt

Test by visiting the URL directly in your browser. If you get a 404, it's in the wrong place.

Mistake #2: Conflicting Directives

Found in: 19% of implementations

Contradictory rules confuse AI crawlers and undermine your policy.

The Problem

# ❌ BAD: Conflicting rules
User-agent: *
Allow: /blog/
Disallow: /blog/private/
Allow: /blog/private/public/  # This contradicts the above!

# Another common conflict
User-agent: GPTBot
Allow: /
Disallow: /  # Which one applies?

The Fix

# ✅ GOOD: Clear hierarchy
User-agent: *
Allow: /blog/
Disallow: /blog/private/
# Don't add exceptions to exceptions

# Clear single directive
User-agent: GPTBot
Allow: /public/
Disallow: /private/

Mistake #3: Missing User-Agent Declarations

Found in: 16% of implementations

Directives without user-agent declarations are ignored.

The Problem

# ❌ BAD: No user-agent specified
Allow: /blog/
Disallow: /admin/
# AI crawlers don't know who this applies to!

The Fix

# ✅ GOOD: Always specify user-agent
User-agent: *
Allow: /blog/
Disallow: /admin/

Mistake #4: Overly Aggressive Crawl Delays

Found in: 14% of implementations

Excessive delays can cause AI crawlers to give up entirely.

The Problem

# ❌ BAD: Way too aggressive
User-agent: *
Crawl-delay: 60  # 60 seconds is excessive!

User-agent: GPTBot
Crawl-delay: 300  # 5 minutes? Really?

The Fix

# ✅ GOOD: Reasonable delays
User-agent: *
Crawl-delay: 2  # 2 seconds is respectful

User-agent: GPTBot
Crawl-delay: 1  # 1 second for trusted bots

Recommended delays: 1-5 seconds for most sites, 5-10 seconds only if you have severe bandwidth constraints.

Mistake #5: Blocking Everything (Unnecessarily)

Found in: 12% of implementations

Many sites block all AI training without considering the benefits.

The Problem

We found many sites with blanket blocks, even when they had public educational content that would benefit from AI visibility:

# ❌ BAD: Blocking everything
User-agent: *
Disallow: /
# Even public blog posts and documentation!

The Fix

Use selective policies that protect sensitive content while allowing public content:

# ✅ GOOD: Selective policy
User-agent: *
Allow: /blog/
Allow: /docs/
Allow: /about/
Disallow: /admin/
Disallow: /user/
Disallow: /premium/

Mistake #6: Incorrect Wildcard Usage

Found in: 11% of implementations

Wildcards don't work the way most people think.

The Problem

# ❌ BAD: Incorrect wildcard syntax
User-agent: *
Disallow: /user*/  # Doesn't work as expected
Disallow: /*.pdf   # Wrong syntax

The Fix

# ✅ GOOD: Correct wildcard usage
User-agent: *
Disallow: /user  # Blocks /user, /users, /user123, etc.
Disallow: /*.pdf$  # Block PDF files (if supported)

Note: Not all AI crawlers support advanced wildcards. Keep patterns simple.

Mistake #7: No Documentation or Comments

Found in: 34% of implementations

Files with no comments are hard to maintain and update.

The Problem

User-agent: *
Allow: /blog/
Disallow: /x/
Disallow: /y/
Disallow: /z/

What are /x/, /y/, and /z/? Why are they blocked? Future maintainers (or even you in 6 months) won't know.

The Fix

# llms.txt - AI Training Policy
# Last updated: 2025-10-07
# Contact: ai-policy@example.com

# Allow public content
User-agent: *
Allow: /blog/

# Block internal tools
Disallow: /admin/  # Admin dashboard
Disallow: /staging/  # Staging environment
Disallow: /test/  # Test pages

Mistake #8: Forgetting About Subdomains

Found in: 9% of implementations

Each subdomain needs its own llms.txt file.

The Problem

Sites with multiple subdomains often only implement llms.txt on the main domain:

  • example.com/llms.txt - Has policy
  • blog.example.com/llms.txt - Missing!
  • docs.example.com/llms.txt - Missing!
  • api.example.com/llms.txt - Missing!

The Fix

Create llms.txt files for each subdomain with appropriate policies for that subdomain's content.

Mistake #9: Not Testing the Implementation

Found in: 27% of implementations

Many files have syntax errors that could have been caught with basic testing.

Common Validation Errors

  • Typos in directive names (Dissallow instead of Disallow)
  • Missing colons after directives
  • Invalid user-agent names
  • Incorrect line breaks or encoding

The Fix

✅ Always Validate

Use our free validation tool before deploying:

Validate Your llms.txt →

Mistake #10: Never Updating the Policy

Found in: 41% of implementations

The AI landscape changes rapidly. Set-and-forget doesn't work.

The Problem

We found llms.txt files that:

  • Don't mention new AI crawlers (Google-Extended, Applebot-Extended)
  • Reference deprecated bot names
  • Have outdated contact information
  • Block paths that no longer exist
  • Allow paths that are now sensitive

The Fix

✅ Regular Review Schedule

  • Quarterly reviews - Check for new AI bots and policy changes
  • After site updates - Update when you add/remove content sections
  • Monitor compliance - Track which bots are actually visiting
  • Document changes - Keep a changelog in comments

Bonus: Positive Patterns We Found

Not everything was bad! Here are patterns from the best implementations:

✅ Clear Documentation

Top implementations include detailed comments, contact info, and last-updated dates.

✅ Selective Policies

The best sites allow public content while protecting sensitive areas—not all-or-nothing.

✅ Bot-Specific Rules

Sophisticated implementations have different rules for different AI systems based on their use cases.

✅ Reasonable Crawl Delays

Best practices use 1-5 second delays—enough to prevent overload without being excessive.

Your Action Plan

Audit Your llms.txt File

  1. 1.Check location: Is it at yourdomain.com/llms.txt?
  2. 2.Validate syntax: Use our free validator to catch errors
  3. 3.Review directives: Check for conflicts and missing user-agents
  4. 4.Add documentation: Include comments and contact info
  5. 5.Set review schedule: Calendar quarterly audits

Get It Right the First Time

Use our free tools to create and validate your llms.txt file:

Conclusion

After analyzing 2,000+ llms.txt files, the pattern is clear: most mistakes are preventable with basic validation and testing.

The good news? Fixing these issues is straightforward. Use the checklist above, validate your implementation, and set a reminder to review quarterly.

Your AI training policy is too important to get wrong. Take 15 minutes today to audit your llms.txt file—your future self will thank you.

Related Articles