Robots.txt Guide: Control How Search Engines Crawl Your Site
The robots.txt file is a powerful tool for controlling how search engines crawl your website. This guide will teach you everything you need to know about robots.txt syntax, directives, and best practices.
What is Robots.txt?
Robots.txt is a text file that tells search engine crawlers which URLs they can or cannot request from your site. It's part of the Robots Exclusion Protocol (REP) and should be placed in the root directory of your website (e.g., https://example.com/robots.txt).
While robots.txt doesn't directly impact rankings, it helps you manage crawl budget, prevent crawling of sensitive pages, and guide search engines to your most important content.
Basic Syntax and Structure
A robots.txt file consists of one or more groups of directives. Each group specifies a user-agent (search engine bot) and rules for that bot:
User-agent: [bot-name] Disallow: [path] Allow: [path] User-agent: [another-bot] Disallow: [path]
Key Directives
- User-agent: Specifies which bot the rules apply to. Use * for all bots.
- Disallow: Blocks the bot from accessing the specified path.
- Allow: Permits access to the specified path (overrides Disallow).
- Sitemap: Points to your XML sitemap location.
- Crawl-delay: Sets a delay between requests (not supported by all bots).
Common Robots.txt Examples
Allow All Crawlers
User-agent: * Allow: / Sitemap: https://example.com/sitemap.xml
Block Specific Directory
User-agent: * Disallow: /admin/ Disallow: /private/ Allow: /
Block Specific File Types
User-agent: * Disallow: /*.pdf$ Disallow: /*.zip$ Allow: /
Block Specific Bot
User-agent: BadBot Disallow: / User-agent: * Allow: /
Crawl Budget Optimization
Crawl budget is the number of pages search engines will crawl on your site within a given timeframe. Robots.txt helps optimize this by:
- Blocking low-value pages (admin, search results, filters)
- Preventing crawling of duplicate content
- Blocking auto-generated pages with no SEO value
- Guiding crawlers to important content via sitemap
Pages to Block
- /search/ - Internal search results
- /filter/ - Filter/sort parameter pages
- /admin/ - Administrative pages
- /cart/ - Shopping cart pages
- /checkout/ - Checkout pages
- /thank-you/ - Confirmation pages
- /login/ - Authentication pages
- /*?sort= - Sort parameter URLs
Common Mistakes to Avoid
- Blocking CSS and JS: Search engines need these to render pages properly
- Blocking important pages: Check that you're not blocking content you want indexed
- Using robots.txt for security: It only blocks crawling, not access. Use proper authentication
- Blocking indexed pages: If pages are already indexed, use noindex instead
- Multiple User-agent blocks: A bot only follows the first matching User-agent block
Testing Your Robots.txt
Always test your robots.txt file before deploying to avoid accidentally blocking important content:
- Google Search Console: Use the robots.txt Tester tool
- Bing Webmaster Tools: Similar testing functionality
- Online validators: Various free tools available
- Fetch as Google: Test how Google crawls specific URLs
Robots.txt vs Meta Robots
Understand when to use robots.txt versus meta robots tags:
| Robots.txt | Meta Robots (noindex) |
|---|---|
| Prevents crawling | Prevents indexing |
| Server-side control | Page-level control |
| Best for crawl budget | Best for removing from index |
| Can still appear in results | Won't appear in results |
Best Practices Checklist
- Place robots.txt in the root directory
- Use lowercase filename (robots.txt)
- Include your sitemap URL
- Test before deploying changes
- Don't block CSS, JS, or images
- Use Allow to override Disallow when needed
- Keep the file simple and clean
- Monitor Google Search Console for errors
- Update when site structure changes
- Document your decisions for future reference