Technical SEO

Robots.txt Guide: Control How Search Engines Crawl Your Site

11 min readDecember 5, 2024

The robots.txt file is a powerful tool for controlling how search engines crawl your website. This guide will teach you everything you need to know about robots.txt syntax, directives, and best practices.

What is Robots.txt?

Robots.txt is a text file that tells search engine crawlers which URLs they can or cannot request from your site. It's part of the Robots Exclusion Protocol (REP) and should be placed in the root directory of your website (e.g., https://example.com/robots.txt).

While robots.txt doesn't directly impact rankings, it helps you manage crawl budget, prevent crawling of sensitive pages, and guide search engines to your most important content.

Basic Syntax and Structure

A robots.txt file consists of one or more groups of directives. Each group specifies a user-agent (search engine bot) and rules for that bot:

Basic Structure

User-agent: [bot-name]
Disallow: [path]
Allow: [path]

User-agent: [another-bot]
Disallow: [path]

Key Directives

User-agent: Specifies which bot the rules apply to. Use * for all bots.
Disallow: Blocks the bot from accessing the specified path.
Allow: Permits access to the specified path (overrides Disallow).
Sitemap: Points to your XML sitemap location.
Crawl-delay: Sets a delay between requests (not supported by all bots).

Common Robots.txt Examples

Allow All Crawlers

User-agent: *
Allow: /
Sitemap: https://example.com/sitemap.xml

Block Specific Directory

User-agent: *
Disallow: /admin/
Disallow: /private/
Allow: /

Block Specific File Types

User-agent: *
Disallow: /*.pdf$
Disallow: /*.zip$
Allow: /

Block Specific Bot

User-agent: BadBot
Disallow: /

User-agent: *
Allow: /

Crawl Budget Optimization

Crawl budget is the number of pages search engines will crawl on your site within a given timeframe. Robots.txt helps optimize this by:

Blocking low-value pages (admin, search results, filters)
Preventing crawling of duplicate content
Blocking auto-generated pages with no SEO value
Guiding crawlers to important content via sitemap

Pages to Block

/search/ - Internal search results
/filter/ - Filter/sort parameter pages
/admin/ - Administrative pages
/cart/ - Shopping cart pages
/checkout/ - Checkout pages
/thank-you/ - Confirmation pages
/login/ - Authentication pages
/*?sort= - Sort parameter URLs

Common Mistakes to Avoid

⚠️ Mistakes That Can Hurt Your SEO

Blocking CSS and JS: Search engines need these to render pages properly
Blocking important pages: Check that you're not blocking content you want indexed
Using robots.txt for security: It only blocks crawling, not access. Use proper authentication
Blocking indexed pages: If pages are already indexed, use noindex instead
Multiple User-agent blocks: A bot only follows the first matching User-agent block

Testing Your Robots.txt

Always test your robots.txt file before deploying to avoid accidentally blocking important content:

Google Search Console: Use the robots.txt Tester tool
Bing Webmaster Tools: Similar testing functionality
Online validators: Various free tools available
Fetch as Google: Test how Google crawls specific URLs

Robots.txt vs Meta Robots

Understand when to use robots.txt versus meta robots tags:

Robots.txt	Meta Robots (noindex)
Prevents crawling	Prevents indexing
Server-side control	Page-level control
Best for crawl budget	Best for removing from index
Can still appear in results	Won't appear in results

Best Practices Checklist

Place robots.txt in the root directory
Use lowercase filename (robots.txt)
Include your sitemap URL
Test before deploying changes
Don't block CSS, JS, or images
Use Allow to override Disallow when needed
Keep the file simple and clean
Monitor Google Search Console for errors
Update when site structure changes
Document your decisions for future reference

Monitor Your Crawl Stats

IndexIQ helps you monitor how search engines crawl your site and identify crawl budget issues.