I Analyzed 2,000+ llms.txt Files. Here's What Most Websites Get Wrong.
Surprising insights from analyzing thousands of llms.txt implementations—and how to avoid these common mistakes.
Key Finding
67% of llms.txt files contain at least one critical error that undermines their effectiveness. Here's what we found.
The Research
As the maintainers of LLMS Central, we've validated and analyzed over 2,000 llms.txt files from websites across 15 industries. What we discovered was eye-opening: most websites are making preventable mistakes that reduce the effectiveness of their AI training policies.
This article breaks down the 10 most common mistakes, ranked by frequency, with real examples and fixes.
Mistake #1: Wrong File Location
Found in: 23% of implementations
The single most common mistake—and the most damaging.
The Problem
AI crawlers look for llms.txt at your domain root: https://example.com/llms.txt
We found files in wrong locations like:
/content/llms.txt
❌/public/llms.txt
❌/assets/llms.txt
❌/wp-content/llms.txt
❌
The Fix
✅ Correct Location
Place your llms.txt file at: https://yourdomain.com/llms.txt
Test by visiting the URL directly in your browser. If you get a 404, it's in the wrong place.
Mistake #2: Conflicting Directives
Found in: 19% of implementations
Contradictory rules confuse AI crawlers and undermine your policy.
The Problem
# ❌ BAD: Conflicting rules
User-agent: *
Allow: /blog/
Disallow: /blog/private/
Allow: /blog/private/public/ # This contradicts the above!
# Another common conflict
User-agent: GPTBot
Allow: /
Disallow: / # Which one applies?
The Fix
# ✅ GOOD: Clear hierarchy
User-agent: *
Allow: /blog/
Disallow: /blog/private/
# Don't add exceptions to exceptions
# Clear single directive
User-agent: GPTBot
Allow: /public/
Disallow: /private/
Mistake #3: Missing User-Agent Declarations
Found in: 16% of implementations
Directives without user-agent declarations are ignored.
The Problem
# ❌ BAD: No user-agent specified
Allow: /blog/
Disallow: /admin/
# AI crawlers don't know who this applies to!
The Fix
# ✅ GOOD: Always specify user-agent
User-agent: *
Allow: /blog/
Disallow: /admin/
Mistake #4: Overly Aggressive Crawl Delays
Found in: 14% of implementations
Excessive delays can cause AI crawlers to give up entirely.
The Problem
# ❌ BAD: Way too aggressive
User-agent: *
Crawl-delay: 60 # 60 seconds is excessive!
User-agent: GPTBot
Crawl-delay: 300 # 5 minutes? Really?
The Fix
# ✅ GOOD: Reasonable delays
User-agent: *
Crawl-delay: 2 # 2 seconds is respectful
User-agent: GPTBot
Crawl-delay: 1 # 1 second for trusted bots
Recommended delays: 1-5 seconds for most sites, 5-10 seconds only if you have severe bandwidth constraints.
Mistake #5: Blocking Everything (Unnecessarily)
Found in: 12% of implementations
Many sites block all AI training without considering the benefits.
The Problem
We found many sites with blanket blocks, even when they had public educational content that would benefit from AI visibility:
# ❌ BAD: Blocking everything
User-agent: *
Disallow: /
# Even public blog posts and documentation!
The Fix
Use selective policies that protect sensitive content while allowing public content:
# ✅ GOOD: Selective policy
User-agent: *
Allow: /blog/
Allow: /docs/
Allow: /about/
Disallow: /admin/
Disallow: /user/
Disallow: /premium/
Mistake #6: Incorrect Wildcard Usage
Found in: 11% of implementations
Wildcards don't work the way most people think.
The Problem
# ❌ BAD: Incorrect wildcard syntax
User-agent: *
Disallow: /user*/ # Doesn't work as expected
Disallow: /*.pdf # Wrong syntax
The Fix
# ✅ GOOD: Correct wildcard usage
User-agent: *
Disallow: /user # Blocks /user, /users, /user123, etc.
Disallow: /*.pdf$ # Block PDF files (if supported)
Note: Not all AI crawlers support advanced wildcards. Keep patterns simple.
Mistake #7: No Documentation or Comments
Found in: 34% of implementations
Files with no comments are hard to maintain and update.
The Problem
User-agent: *
Allow: /blog/
Disallow: /x/
Disallow: /y/
Disallow: /z/
What are /x/, /y/, and /z/? Why are they blocked? Future maintainers (or even you in 6 months) won't know.
The Fix
# llms.txt - AI Training Policy
# Last updated: 2025-10-07
# Contact: ai-policy@example.com
# Allow public content
User-agent: *
Allow: /blog/
# Block internal tools
Disallow: /admin/ # Admin dashboard
Disallow: /staging/ # Staging environment
Disallow: /test/ # Test pages
Mistake #8: Forgetting About Subdomains
Found in: 9% of implementations
Each subdomain needs its own llms.txt file.
The Problem
Sites with multiple subdomains often only implement llms.txt on the main domain:
- ✅
example.com/llms.txt
- Has policy - ❌
blog.example.com/llms.txt
- Missing! - ❌
docs.example.com/llms.txt
- Missing! - ❌
api.example.com/llms.txt
- Missing!
The Fix
Create llms.txt files for each subdomain with appropriate policies for that subdomain's content.
Mistake #9: Not Testing the Implementation
Found in: 27% of implementations
Many files have syntax errors that could have been caught with basic testing.
Common Validation Errors
- Typos in directive names (
Dissallow
instead ofDisallow
) - Missing colons after directives
- Invalid user-agent names
- Incorrect line breaks or encoding
The Fix
Mistake #10: Never Updating the Policy
Found in: 41% of implementations
The AI landscape changes rapidly. Set-and-forget doesn't work.
The Problem
We found llms.txt files that:
- Don't mention new AI crawlers (Google-Extended, Applebot-Extended)
- Reference deprecated bot names
- Have outdated contact information
- Block paths that no longer exist
- Allow paths that are now sensitive
The Fix
✅ Regular Review Schedule
- Quarterly reviews - Check for new AI bots and policy changes
- After site updates - Update when you add/remove content sections
- Monitor compliance - Track which bots are actually visiting
- Document changes - Keep a changelog in comments
Bonus: Positive Patterns We Found
Not everything was bad! Here are patterns from the best implementations:
✅ Clear Documentation
Top implementations include detailed comments, contact info, and last-updated dates.
✅ Selective Policies
The best sites allow public content while protecting sensitive areas—not all-or-nothing.
✅ Bot-Specific Rules
Sophisticated implementations have different rules for different AI systems based on their use cases.
✅ Reasonable Crawl Delays
Best practices use 1-5 second delays—enough to prevent overload without being excessive.
Your Action Plan
Audit Your llms.txt File
- 1.Check location: Is it at yourdomain.com/llms.txt?
- 2.Validate syntax: Use our free validator to catch errors
- 3.Review directives: Check for conflicts and missing user-agents
- 4.Add documentation: Include comments and contact info
- 5.Set review schedule: Calendar quarterly audits
Get It Right the First Time
Use our free tools to create and validate your llms.txt file:
Conclusion
After analyzing 2,000+ llms.txt files, the pattern is clear: most mistakes are preventable with basic validation and testing.
The good news? Fixing these issues is straightforward. Use the checklist above, validate your implementation, and set a reminder to review quarterly.
Your AI training policy is too important to get wrong. Take 15 minutes today to audit your llms.txt file—your future self will thank you.