Robots.txt Generator
Create, customize, and validate your robots.txt with AI crawler management
Google ignores Crawl-delay. Bing and Yandex support it.
What is Robots.txt?
A robots.txt file is a plain text file placed at your website’s root (e.g., https://example.com/robots.txt) that tells crawlers which pages they can or cannot access. It follows the Robots Exclusion Protocol, defined in RFC 9309.
Every major search engine respects robots.txt. When Googlebot, Bingbot, or any compliant crawler arrives, it checks /robots.txt first. Robots.txt controls crawling (fetching pages) — not indexing (appearing in search results). For indexing control, use noindex meta tags.
Crawl Control
Specify which paths crawlers can and cannot access on your site.
AI Bot Management
Block AI training crawlers while allowing AI search bots for visibility.
Sitemap Discovery
Point crawlers to your XML sitemap for better content discovery.
AI Crawlers: Training vs. Search
In 2026, the most critical robots.txt decision is managing AI crawlers. There are two distinct categories:
AI Training Crawlers
These bots scrape content to build datasets for large language models. Blocking them prevents your content from being used for model training but has no effect on search visibility. Nearly 21% of top websites now reference GPTBot in their robots.txt.
AI Search Crawlers
These bots fetch pages on-demand for AI-powered search results (ChatGPT browsing, Perplexity, Google AI Overviews). Allowing them means your content can appear as a cited source in AI search, driving traffic to your site.
| Bot | Owner | Type | Recommendation |
|---|---|---|---|
GPTBot | OpenAI | Training | Block if protecting content |
Google-Extended | Training | Block to opt out of Gemini training | |
ClaudeBot | Anthropic | Training | Block if protecting content |
CCBot | Common Crawl | Training | Block to reduce AI dataset inclusion |
ChatGPT-User | OpenAI | Search | Allow for AI search visibility |
PerplexityBot | Perplexity | Search | Allow for AI search citations |
Directives Reference
| Directive | Purpose | Example |
|---|---|---|
User-agent | Target specific crawler | User-agent: Googlebot |
Disallow | Block path from crawling | Disallow: /admin/ |
Allow | Override broader Disallow | Allow: /admin/ajax.php |
Sitemap | Point to XML sitemap | Sitemap: https://example.com/sitemap.xml |
Crawl-delay | Seconds between requests | Crawl-delay: 10 |
* (wildcard) | Match any string | Disallow: /*.pdf$ |
$ (end) | Match end of URL | Disallow: /page?*$ |
Best Practices
- Test your robots.txt before deploying
- Use absolute URLs for Sitemap directives
- Include a Sitemap reference for better discovery
- Review quarterly — new bots appear regularly
- Use Allow to override broader Disallow rules
- Block AI training bots if protecting content
- Use robots.txt for security (it’s publicly readable)
- Block CSS/JS files (prevents page rendering)
- Expect robots.txt to remove already-indexed pages
- Use Disallow: / unless you want to block everything
- Forget the trailing slash on directory paths
- Assume all bots respect robots.txt
Common Mistakes
| Mistake | Impact | Fix |
|---|---|---|
| Blocking CSS/JS files | Search engines can’t render pages correctly | Allow /wp-content/, /assets/ |
| Using robots.txt for noindex | Pages may still appear in SERPs via backlinks | Use <meta name="robots" content="noindex"> |
| Relative sitemap URLs | Crawlers can’t find your sitemap | Use full URL: https://example.com/sitemap.xml |
| Blocking the entire site accidentally | Complete de-indexing over time | Use specific paths instead of Disallow: / |
| Not managing AI crawlers | Content used for AI training without consent | Explicitly block unwanted AI bots by user-agent |
| Forgetting case sensitivity | Rules may not match intended paths | Match the exact case of your URL paths |
Frequently Asked Questions
Not directly. Robots.txt controls which pages Googlebot can crawl, but it doesn’t influence how pages rank once indexed. However, misconfigured robots.txt can prevent important pages from being crawled and indexed at all, which effectively removes them from search results.
Yes. Each AI company uses specific user-agent tokens. For example, GPTBot for OpenAI’s training crawler, ClaudeBot for Anthropic, and Google-Extended for Google’s Gemini training. Block them individually with their user-agent names. Note: blocking GPTBot blocks training only — the separate ChatGPT-User agent handles AI-powered search.
Always at the root of your domain: https://example.com/robots.txt. It must be accessible at this exact URL. Subdomains need their own robots.txt files — https://blog.example.com/robots.txt is separate from the main domain’s file.
Google caches robots.txt for up to 24 hours. After updating, changes typically take effect within a day. You can request a re-crawl through Google Search Console for faster processing. Bing and other engines may take longer.
It depends. SEO tool crawlers (AhrefsBot, SemrushBot, etc.) index your backlinks and keywords. Blocking them hides your data from competitors but also prevents you from analyzing your own site in these tools. Most sites leave them at default settings.
Review at least quarterly. New AI crawlers emerge regularly, site structure may change, and outdated rules can harm SEO. After any major site restructure, migration, or launch, verify that your robots.txt still reflects current requirements.