I Analyzed 2,000+ llms.txt Files. Here's What Most Websites Get Wrong.

The Research

As the maintainers of LLMS Central, we've validated and analyzed over 2,000 llms.txt files from websites across 15 industries. What we discovered was eye-opening: most websites are making preventable mistakes that reduce the effectiveness of their AI training policies.

This article breaks down the 10 most common mistakes, ranked by frequency, with real examples and fixes.

Mistake #1: Wrong File Location

Found in: 23% of implementations

The single most common mistake—and the most damaging.

The Problem

AI crawlers look for llms.txt at your domain root: https://example.com/llms.txt

We found files in wrong locations like:

/content/llms.txt ❌
/public/llms.txt ❌
/assets/llms.txt ❌
/wp-content/llms.txt ❌

The Fix

✅ Correct Location

Place your llms.txt file at: https://yourdomain.com/llms.txt

Test by visiting the URL directly in your browser. If you get a 404, it's in the wrong place.

Mistake #2: Conflicting Directives

Found in: 19% of implementations

Contradictory rules confuse AI crawlers and undermine your policy.

The Problem

# ❌ BAD: Conflicting rules
User-agent: *
Allow: /blog/
Disallow: /blog/private/
Allow: /blog/private/public/  # This contradicts the above!

# Another common conflict
User-agent: GPTBot
Allow: /
Disallow: /  # Which one applies?

The Fix

# ✅ GOOD: Clear hierarchy
User-agent: *
Allow: /blog/
Disallow: /blog/private/
# Don't add exceptions to exceptions

# Clear single directive
User-agent: GPTBot
Allow: /public/
Disallow: /private/

Mistake #3: Missing User-Agent Declarations

Found in: 16% of implementations

Directives without user-agent declarations are ignored.

The Problem

# ❌ BAD: No user-agent specified
Allow: /blog/
Disallow: /admin/
# AI crawlers don't know who this applies to!

The Fix

# ✅ GOOD: Always specify user-agent
User-agent: *
Allow: /blog/
Disallow: /admin/

Mistake #4: Overly Aggressive Crawl Delays

Found in: 14% of implementations

Excessive delays can cause AI crawlers to give up entirely.

The Problem

# ❌ BAD: Way too aggressive
User-agent: *
Crawl-delay: 60  # 60 seconds is excessive!

User-agent: GPTBot
Crawl-delay: 300  # 5 minutes? Really?

The Fix

# ✅ GOOD: Reasonable delays
User-agent: *
Crawl-delay: 2  # 2 seconds is respectful

User-agent: GPTBot
Crawl-delay: 1  # 1 second for trusted bots

Recommended delays: 1-5 seconds for most sites, 5-10 seconds only if you have severe bandwidth constraints.

Mistake #5: Blocking Everything (Unnecessarily)

Found in: 12% of implementations

Many sites block all AI training without considering the benefits.

The Problem

We found many sites with blanket blocks, even when they had public educational content that would benefit from AI visibility:

# ❌ BAD: Blocking everything
User-agent: *
Disallow: /
# Even public blog posts and documentation!

The Fix

Use selective policies that protect sensitive content while allowing public content:

# ✅ GOOD: Selective policy
User-agent: *
Allow: /blog/
Allow: /docs/
Allow: /about/
Disallow: /admin/
Disallow: /user/
Disallow: /premium/

Mistake #6: Incorrect Wildcard Usage

Found in: 11% of implementations

Wildcards don't work the way most people think.

The Problem

# ❌ BAD: Incorrect wildcard syntax
User-agent: *
Disallow: /user*/  # Doesn't work as expected
Disallow: /*.pdf   # Wrong syntax

The Fix

# ✅ GOOD: Correct wildcard usage
User-agent: *
Disallow: /user  # Blocks /user, /users, /user123, etc.
Disallow: /*.pdf$  # Block PDF files (if supported)

Note: Not all AI crawlers support advanced wildcards. Keep patterns simple.

Mistake #7: No Documentation or Comments

Found in: 34% of implementations

Files with no comments are hard to maintain and update.

The Problem

User-agent: *
Allow: /blog/
Disallow: /x/
Disallow: /y/
Disallow: /z/

What are /x/, /y/, and /z/? Why are they blocked? Future maintainers (or even you in 6 months) won't know.

The Fix

# llms.txt - AI Training Policy
# Last updated: 2025-10-07
# Contact: ai-policy@example.com

# Allow public content
User-agent: *
Allow: /blog/

# Block internal tools
Disallow: /admin/  # Admin dashboard
Disallow: /staging/  # Staging environment
Disallow: /test/  # Test pages

Mistake #8: Forgetting About Subdomains

Found in: 9% of implementations

Each subdomain needs its own llms.txt file.

The Problem

Sites with multiple subdomains often only implement llms.txt on the main domain:

✅ example.com/llms.txt - Has policy
❌ blog.example.com/llms.txt - Missing!
❌ docs.example.com/llms.txt - Missing!
❌ api.example.com/llms.txt - Missing!

The Fix

Create llms.txt files for each subdomain with appropriate policies for that subdomain's content.

Mistake #9: Not Testing the Implementation

Found in: 27% of implementations

Many files have syntax errors that could have been caught with basic testing.

Common Validation Errors

Typos in directive names (Dissallow instead of Disallow)
Missing colons after directives
Invalid user-agent names
Incorrect line breaks or encoding

The Fix

✅ Always Validate

Use our free validation tool before deploying:

Validate Your llms.txt →

Mistake #10: Never Updating the Policy

Found in: 41% of implementations

The AI landscape changes rapidly. Set-and-forget doesn't work.

The Problem

We found llms.txt files that:

Don't mention new AI crawlers (Google-Extended, Applebot-Extended)
Reference deprecated bot names
Have outdated contact information
Block paths that no longer exist
Allow paths that are now sensitive

The Fix

✅ Regular Review Schedule

Quarterly reviews - Check for new AI bots and policy changes
After site updates - Update when you add/remove content sections
Monitor compliance - Track which bots are actually visiting
Document changes - Keep a changelog in comments

Bonus: Positive Patterns We Found

Not everything was bad! Here are patterns from the best implementations:

✅ Clear Documentation

Top implementations include detailed comments, contact info, and last-updated dates.

✅ Selective Policies

The best sites allow public content while protecting sensitive areas—not all-or-nothing.

✅ Bot-Specific Rules

Sophisticated implementations have different rules for different AI systems based on their use cases.

✅ Reasonable Crawl Delays

Best practices use 1-5 second delays—enough to prevent overload without being excessive.

Your Action Plan

Audit Your llms.txt File

1.Check location: Is it at yourdomain.com/llms.txt?
2.Validate syntax: Use our free validator to catch errors
3.Review directives: Check for conflicts and missing user-agents
4.Add documentation: Include comments and contact info
5.Set review schedule: Calendar quarterly audits

Get It Right the First Time

Use our free tools to create and validate your llms.txt file:

Create llms.txt File Validate & Submit

Conclusion

After analyzing 2,000+ llms.txt files, the pattern is clear: most mistakes are preventable with basic validation and testing.

The good news? Fixing these issues is straightforward. Use the checklist above, validate your implementation, and set a reminder to review quarterly.

Your AI training policy is too important to get wrong. Take 15 minutes today to audit your llms.txt file—your future self will thank you.

Key Finding

The Research

Mistake #1: Wrong File Location

The Problem

The Fix

Mistake #2: Conflicting Directives

The Problem

The Fix

Mistake #3: Missing User-Agent Declarations

The Problem

The Fix

Mistake #4: Overly Aggressive Crawl Delays

The Problem

The Fix

Mistake #5: Blocking Everything (Unnecessarily)

The Problem

The Fix

Mistake #6: Incorrect Wildcard Usage

The Problem

The Fix

Mistake #7: No Documentation or Comments

The Problem

The Fix

Mistake #8: Forgetting About Subdomains

The Problem

The Fix

Mistake #9: Not Testing the Implementation

Common Validation Errors

The Fix

Mistake #10: Never Updating the Policy

The Problem

The Fix

Bonus: Positive Patterns We Found

✅ Clear Documentation

✅ Selective Policies

✅ Bot-Specific Rules

✅ Reasonable Crawl Delays

Your Action Plan

Audit Your llms.txt File

Get It Right the First Time

Conclusion

📚Related Articles

LLMS.txt Adoption Report 2025

Top 100 Websites Using LLMS.txt

What is llms.txt? The Complete Guide to AI Training Guidelines